--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




________________________________
From: Gao Pinker <xiaoping...@gmail.com>
To: java-dev@lucene.apache.org
Sent: Thursday, April 16, 2009 9:58:51 AM
Subject: I wanna contribute a Chinese analyzer to lucene

Hi All!

I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
language, it's called imdict-chinese-analyzer as it is a subproject of imdict, 
which is an intelligent online dictionary.

The project on google code is here: 
http://code.google.com/p/imdict-chinese-analyzer/

In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
"中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
properly, or there will be mis-understandings everywhere in the index 
constructed by Lucene, and the accuracy of the search engine will be affected 
seriously!

Although there are two analyzer packages in apache repository which can handle 
Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two 
adjoining characters as a single word, this is obviously not true in reality, 
also this strategy will increase the index size and hit the performance baddly.

The algorithm ofimdict-chinese-analyzer is based on Hidden Markov Model (HMM), 
so it can  tokenize chinese sentence in a really intelligent way. Tokenizaion 
accuracy of this model is above 90% according to the paper "HHMM-based Chinese 
Lexical analyzer ICTCLAL".

As imdict-chinese-analyzer is a really fast intelligent Chinese Analyzer for 
lucene written in Java. I want to share this project with every one using 
Lucene.

This Analyzer contains two packages, the source code and the lexical 
dictionary. I want to publish the source code using Apache license, but the 
dictionary which is under an ambigus license was not create by me.
So, can I only submit the source code to lucene contribution repository, and 
let the users download the dictionary from the google code site?

please help me about this contribution.

Reply via email to