[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

Uwe Schindler (JIRA) Thu, 07 May 2009 06:34:54 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706887#action_12706887
 ]


Uwe Schindler commented on LUCENE-1629:
---------------------------------------

Hi Xiaoping,

looks good, but I have some suggestions:
- Making the datafile only readable from a RandomAccessFile makes it hard to 
e.g. move the data file directly into the jar file. I would like to put the 
data file directly into the package directory  and load it with 
Class.getResourceAsStream(). In this case, the binary Lucene analyzer jar would 
be ready-to-use and the analyzer would run out of the box. Often configuring 
external files in e.g. web applications is complicated. An automatism to load 
the file from the JAR would be fine.
- I have seen some singleton implementations, where the getInstance() static 
method is not synchronized. Without it there may be more than one instance, if 
different threads call getInstance() at the same time or close together.
- Do we compile the source files with a fixed encoding of UTF-8 (build.xml?). 
If not, there may be problems, if the Java compiler uses another encoding 
(because platform default).

> contrib intelligent Analyzer for Chinese
> ----------------------------------------
>
>                 Key: LUCENE-1629
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1629
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.4.1
>         Environment: for java 1.5 or higher, lucene 2.4.1
>            Reporter: Xiaoping Gao
>         Attachments: analysis-data.zip, LUCENE-1629.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

Reply via email to