Re: [jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

DM Smith Thu, 07 May 2009 19:43:01 -0700

I'd prefer it to stay 1.4 for now and would be willing to make thechange, if needed.


-- DM


On May 7, 2009, at 3:04 PM, Michael McCandless (JIRA) wrote:

[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707042#action_12707042 ]
Michael McCandless commented on LUCENE-1629:
--------------------------------------------
bq. There is lots of code depending on Java 1.5, I use enum,generalization frequently. Because I saw these points on apache wiki:
Well... "in general" contrib packages can be 1.5, but the analyzerscontrib package is widely used, and is not 1.5 now, so it's abiggish change to force it to 1.5 with this. We should at leastseparate discuss in on java-dev if we want to consider allowing 1.5code into contrib-analyzers.
We could hold off on committing this until 3.0?
contrib intelligent Analyzer for Chinese
----------------------------------------

               Key: LUCENE-1629
               URL: https://issues.apache.org/jira/browse/LUCENE-1629
           Project: Lucene - Java
        Issue Type: Improvement
        Components: contrib/analyzers
  Affects Versions: 2.4.1
       Environment: for java 1.5 or higher, lucene 2.4.1
          Reporter: Xiaoping Gao
       Attachments: analysis-data.zip, LUCENE-1629.patch
I wrote a Analyzer for apache lucene for analyzing sentences inChinese language. it's called "imdict-chinese-analyzer", theproject on google code is here: http://code.google.com/p/imdict-chinese-analyzer/In Chinese, "我是中国人"(I am Chinese), should be tokenized as"我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or therewill be mis-understandings everywhere in the index constructed byLucene, and the accuracy of the search engine will be affectedseriously!Although there are two analyzer packages in apache repository whichcan handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take eachcharacter or every two adjoining characters as a single word, thisis obviously not true in reality, also this strategy will increasethe index size and hurt the performance baddly.The algorithm of imdict-chinese-analyzer is based on Hidden MarkovModel (HMM), so it can tokenize chinese sentence in a reallyintelligent way. Tokenizaion accuracy of this model is above 90%according to the paper "HHMM-based Chinese Lexical analyzerICTCLAL" while other analyzer's is about 60%.As imdict-chinese-analyzer is a really fast and intelligent. I wantto contribute it to the apache lucene repository.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

Reply via email to