I'd prefer it to stay 1.4 for now and would be willing to make the
change, if needed.
-- DM
On May 7, 2009, at 3:04 PM, Michael McCandless (JIRA) wrote:
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707042
#action_12707042 ]
Michael McCandless commented on LUCENE-1629:
--------------------------------------------
bq. There is lots of code depending on Java 1.5, I use enum,
generalization frequently. Because I saw these points on apache wiki:
Well... "in general" contrib packages can be 1.5, but the analyzers
contrib package is widely used, and is not 1.5 now, so it's a
biggish change to force it to 1.5 with this. We should at least
separate discuss in on java-dev if we want to consider allowing 1.5
code into contrib-analyzers.
We could hold off on committing this until 3.0?
contrib intelligent Analyzer for Chinese
----------------------------------------
Key: LUCENE-1629
URL: https://issues.apache.org/jira/browse/LUCENE-1629
Project: Lucene - Java
Issue Type: Improvement
Components: contrib/analyzers
Affects Versions: 2.4.1
Environment: for java 1.5 or higher, lucene 2.4.1
Reporter: Xiaoping Gao
Attachments: analysis-data.zip, LUCENE-1629.patch
I wrote a Analyzer for apache lucene for analyzing sentences in
Chinese language. it's called "imdict-chinese-analyzer", the
project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/
In Chinese, "我是中国人"(I am Chinese), should be tokenized as
"我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国
人". So the analyzer must handle each sentence properly, or there
will be mis-understandings everywhere in the index constructed by
Lucene, and the accuracy of the search engine will be affected
seriously!
Although there are two analyzer packages in apache repository which
can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each
character or every two adjoining characters as a single word, this
is obviously not true in reality, also this strategy will increase
the index size and hurt the performance baddly.
The algorithm of imdict-chinese-analyzer is based on Hidden Markov
Model (HMM), so it can tokenize chinese sentence in a really
intelligent way. Tokenizaion accuracy of this model is above 90%
according to the paper "HHMM-based Chinese Lexical analyzer
ICTCLAL" while other analyzer's is about 60%.
As imdict-chinese-analyzer is a really fast and intelligent. I want
to contribute it to the apache lucene repository.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org