[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

Uwe Schindler (JIRA) Thu, 14 May 2009 08:05:22 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709425#action_12709425
 ]


Uwe Schindler commented on LUCENE-1629:
---------------------------------------

Hi Xiaoping,

Thanks! The code is now committed.

Only for the understanding (as I do not know chinese and cannot read some 
comments), some questions/comments:
The .mem files are serializations of the dictionaries. They are created by 
loading from the random access file (these dct files) and then serialized to 
the mem files. But for developers and further updates you need to have the dct 
files and rerun these steps (that are all these private methods).
An interesting addition would be to create a custom build step, that uses the 
dct files and builds the .mem files from it. How could I invoke that? So maybe 
you could extract the useless dct file loaders from the current classes and 
create a separate tool from it, that could be invoked from ant, that builds 
that mem files.

Uwe

P.S.: By the way: In these private conversation methods (that are never called 
from the library code) you have these default try-catch blocks, which is bad 
programming practice. So the proposed separate conversion tool should correctly 
handle the exceptions or better just not catch them at all and pass up (side 
note: I hate eclipse for generating these auto-catch blocks, better would be to 
auto-add throws-clauses to the method signatures!)

> contrib intelligent Analyzer for Chinese
> ----------------------------------------
>
>                 Key: LUCENE-1629
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1629
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.4.1
>         Environment: for java 1.5 or higher, lucene 2.4.1
>            Reporter: Xiaoping Gao
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>         Attachments: analysis-data.zip, bigramdict.mem, 
> build-resources-with-folder.patch, build-resources.patch, 
> build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, 
> LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

Reply via email to