[jira] [Commented] (LUCENE-8231) Godori, a Korean analyzer based on mecab-ko-dic

Jim Ferenczi (JIRA) Thu, 29 Mar 2018 03:06:14 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418690#comment-16418690
 ]


Jim Ferenczi commented on LUCENE-8231:
--------------------------------------

Thanks for looking Robert !

{quote}

Should there be a ReadingFormFilter similar to the kuromoji case?

{quote}

I attached a new patch that adds this filter, the readings were already in the 
binary dictionary so this does not change the size of the jar. 

{quote}

In the kuromoji case there are is a lot of japanese-specific compression that 
Uwe and I did, if you are worried about size/ram we can try to shrink this 
korean data in ways that make sense for it. That can really be a 
followup/polish/nice-to-have: how big is the built JAR now? something 
semi-reasonable?

{quote}

 

I think the size is reasonable especially if you compare with other libraries 
that use the mecab-ko-dic ;).

Though there are still some rooms for improvement. I did not add the semantic 
class of the token but we could do the same than for the Japanese dic where the 
pos are added in a separate file. The semantic class + POS is unique per leftId 
so this could also save 1 byte in the binary dictionary (we use 1 byte per POS 
per term in the main dictionary).

The expression that contains the decompounds can also be compressed. For 
compound nouns I serialize the segmentations with the term but we could just 
use offset from the surface form. It doesn't work for Inflects which can add 
tokens or use a different form. To be honest I don't know how we can derive the 
offsets for the decompound of Inflects, I don't think there is an easy way to 
do that but I could be completely wrong.

 

In the patch I attached the user dictionary is broken, I copied the one from 
Kuromoji but we should probably change it to accept simple nouns (NNG or NNP) 
where there is no segmentation and use the PREANALYSIS type to add custom 
segmentations (or COMPOUND for nouns only).

 

I talked to my Korean colleagues and they told me that godori has a negative 
meaning in Korea. It is linked with illegal gambling and it also has an ancient 
meaning of "killed by king's order" which is bad :(. This is what happens when 
you pick a name without knowing the culture so apologize for this. I changed 
the name to "nori" which is the proposal they made, it means joy/play. This is 
a very generic name, in Japanese it means seaweeds and is used to wrap sushis 
and onigiri which I find nice because it 's also a reference to the Japanese 
analyzer which was used as a root for this.

 

 

> Godori, a Korean analyzer based on mecab-ko-dic
> -----------------------------------------------
>
>                 Key: LUCENE-8231
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8231
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Jim Ferenczi
>            Priority: Major
>         Attachments: LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features 
> adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological 
> analysis (left cost + right cost + word cost) I tried to adapt the module to 
> handle Korean with the mecab-ko-dic. I've started with a POC that copies the 
> Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make 
> some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the 
> IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation 
> that can be applied. 
> I attached the patch that contains this new Korean module called godori (a 
> popular card game in Korea). It is an adaptation of the Kuromoji module so 
> currently
> the two modules don't share any code. I wanted to validate the approach first 
> and check the relevancy of the results. I don't speak Korean so I used the 
> relevancy
> tests that was added for another Korean tokenizer 
> (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output 
> against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output 
> and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. 
> Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the 
> token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the 
> mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library 
> which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects 
> defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, 
> interjection, ...). These filters don't play well with the tokenizer if it 
> can 
> output multiple paths (nBest output for instance) so for simplicity I removed 
> this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working 
> and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the 
> results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this 
> project. I started to extract the part of the code that could be shared with 
> the
> Kuromoji module but I wanted to share the status and this POC first to 
> confirm that this approach is viable. The advantages of using the same model 
> than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the 
> moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the 
> lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the 
> following command:
> ant regenerate (you need to create the resource directory (mkdir 
> lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) 
> first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8231) Godori, a Korean analyzer based on mecab-ko-dic

Reply via email to