[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

Michael McCandless (JIRA) Mon, 11 May 2009 03:58:11 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707969#action_12707969
 ]


Michael McCandless commented on LUCENE-1629:
--------------------------------------------

When I run "ant test" in contrib/analyzers, SmartChineseAnalyzer is unable to 
locate the stopwords.txt:
{code}
    [junit] Testcase: 
testChineseAnalyzer(org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer):    
  Caused an ERROR
    [junit] null
    [junit] java.lang.NullPointerException
    [junit]     at java.io.Reader.<init>(Reader.java:61)
    [junit]     at java.io.InputStreamReader.<init>(InputStreamReader.java:80)
    [junit]     at 
org.apache.lucene.analysis.cn.SmartChineseAnalyzer.loadStopWords(SmartChineseAnalyzer.java:112)
    [junit]     at 
org.apache.lucene.analysis.cn.SmartChineseAnalyzer.<init>(SmartChineseAnalyzer.java:71)
    [junit]     at 
org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer.testChineseAnalyzer(TestSmartChineseAnalyzer.java:36)
{code}


> contrib intelligent Analyzer for Chinese
> ----------------------------------------
>
>                 Key: LUCENE-1629
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1629
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.4.1
>         Environment: for java 1.5 or higher, lucene 2.4.1
>            Reporter: Xiaoping Gao
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>         Attachments: analysis-data.zip, bigramdict.mem, coredict.mem, 
> LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

Reply via email to