[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese

Uwe Schindler (JIRA) Thu, 14 May 2009 02:09:10 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uwe Schindler updated LUCENE-1629:
----------------------------------

    Attachment: build-resources-with-folder.patch

This is a second try, again with the resources folder. It is now optional, to 
have a src/resources folder, if it exists, all files from inside are copied to 
the build destination.

The trick was, that the copy task can additionally use a globmapping, and by 
that, does the following:
- The source fileset of the copy task uses the src/ folder directly
- The fileset only includes resources/**
- Because then the target folder would get an additional sub-folder "resources" 
(because the base dir of the copy operation is "src/"), the filenames are 
replaced by a globmapping, stripping the "resources/" from the relative path

This patch also adds a simple test case, that shows, that ArabicAnalyzer does 
not start correctly, when the stopwords.txt file is not in the classpath. The 
test fails, if the stopwords.txt file stays at the original location and/or the 
copy operation is commented out.

The patch does not contain the deletion of the arabic stopwords file from the 
sources folder (was binary), so remove it by hand or simply move it after 
aplying the patch.

> contrib intelligent Analyzer for Chinese
> ----------------------------------------
>
>                 Key: LUCENE-1629
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1629
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.4.1
>         Environment: for java 1.5 or higher, lucene 2.4.1
>            Reporter: Xiaoping Gao
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>         Attachments: analysis-data.zip, bigramdict.mem, 
> build-resources-with-folder.patch, build-resources.patch, 
> build-resources.patch, coredict.mem, LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese

Reply via email to