[ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436063#comment-16436063
 ] 

Robert Muir commented on LUCENE-8231:
-------------------------------------

I didn't really see consensus on this issue though (there was some discussion 
about it on LUCENE-4065, and it seemed FilteringTokenFilter may be doing the 
right thing) definitely think its a concern unrelated to korean and we 
shouldn't put stopword filtering into our tokenizers yet until its understood 
and discussed.

> Nori, a Korean analyzer based on mecab-ko-dic
> ---------------------------------------------
>
>                 Key: LUCENE-8231
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8231
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Jim Ferenczi
>            Priority: Major
>         Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, 
> LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called -godori- 
> nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to