Hi All!

I wrote a Analyzer for apache lucene for analyzing sentences in
*Chinese*language, it's called
*imdict-chinese-analyzer* as it is a subproject of
*imdict*<http://www.imdict.net/>,
which is an intelligent online dictionary.

The project on google code is here:
http://code.google.com/p/imdict-chinese-analyzer/

In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)
"中国人"(Chinese), *not* "我" "是中" "国人". So the analyzer must handle each
sentence properly, or there will be mis-understandings everywhere in the
index constructed by Lucene, and the accuracy of the search engine will be
affected seriously!

Although there are two analyzer packages in apache repository which can
handle Chinese:
ChineseAnalyzer<http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/>and
CJKAnalyzer<http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cjk/>,
they take each character or every two adjoining characters as a single word,
this is obviously not true in reality, also this strategy will increase the
index size and hit the performance baddly.

The algorithm of* imdict-chinese-analyzer* is based on Hidden Markov Model
(HMM), so it can tokenize chinese sentence in a really intelligent way.
Tokenizaion accuracy of this model is above 90% according to the paper
"HHMM-based
Chinese Lexical analyzer
ICTCLAL<http://www.nlp.org.cn/project/project.php?proj_id=6>
".

As *imdict-chinese-analyzer* is a really fast intelligent Chinese Analyzer
for lucene written in Java. I want to share this project with every one
using Lucene.

This Analyzer contains two packages, *the source code* and the *lexical
dictionary*. I want to publish the source code using Apache license, but the
dictionary which is under an ambigus license was not create by me.
So, can I only submit the source code to lucene contribution repository, and
let the users download the dictionary from the google code site?

please help me about this contribution.

Reply via email to