[jira] [Commented] (LUCENE-8631) How Nori Tokenizer can deal with Longest-Matching

ASF subversion and git services (JIRA) Tue, 12 Mar 2019 02:45:15 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790391#comment-16790391
 ]


ASF subversion and git services commented on LUCENE-8631:
---------------------------------------------------------

Commit 8d0652451ea4ed9d0285fb5f8c7568c058c6730b in lucene-solr's branch 
refs/heads/branch_8x from Yeongsu Kim
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8d06524 ]

LUCENE-8631: The Korean user dictionary now picks the longest-matching word and 
discards the other matches.


> How Nori Tokenizer can deal with Longest-Matching
> -------------------------------------------------
>
>                 Key: LUCENE-8631
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8631
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Yeongsu Kim
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> I think... Nori tokenizer has one issue. 
> I don’t understand why “Longest-Matching” is NOT working to Nori tokenizer 
> via config mode (config mode: 
> [https://www.elastic.co/guide/en/elasticsearch/plugins/6.x/analysis-nori-tokenizer.html).]
>  
> Here is an example for explaining what is longest-matching.
> Let assume we have `userdict_ko.txt` including only three Korean single-words 
> such as ‘골드’, ‘브라운’, ‘골드브라운’, and save it to Nori analyzer. After update, we 
> can see that it outputs two tokens such as ‘골드’ and ‘브라운’, when the input is 
> ‘골드브라운’. (In English: ‘골드’ means ‘gold’, ‘브라운’ means ‘brown’, and ‘골드브라운’ 
> means ‘goldbrown’)
>  
> With this result, we recognize that “Longest-Matching” is NOT working. If 
> “Longest-Matching” is working, the output must be ‘골드브라운’, which is the 
> longest matching word in the user dictionary.
>  
> Curiously enough, when we add user dictionary via custom mode (custom mode: 
> [https://github.com/jimczi/nori/blob/master/how-to-custom-dict.asciidoc|https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjimczi%2Fnori%2Fblob%2Fmaster%2Fhow-to-custom-dict.asciidoc&data=02%7C01%7Chigh_yeongsu%40wemakeprice.com%7C6953d739414e4da5ad1408d67473a6fe%7C6322d5f522044e9d9ca6d18828a04daf%7C0%7C0%7C636824437418170758&sdata=5iuNvKr8WJCXlCkJQrf5r3BgDVnF5hpG7l%2BQL0Ok7Aw%3D&reserved=0]),
>  we found the result is ‘골드브라운’, where ‘Longest-Matching’ is applied. We 
> think the reason is because learned Mecab engine automatically generates word 
> costs by its own criteria. We hope this mechanism is also applied to config 
> mode.
>  
> Would you tell me the way to “Longest-Matching” via config mode (not custom) 
> or give me some hints (e.g. where to modify source codes) to solve this 
> problem?
>  
> P.S
> Recently, I've mailed to [~jim.ferenczi], who is a developer of Nori, and 
> received his suggestions:
>    - Add a way to set a score to each new rule (this way you could set up a 
> negative cost for the compound word that is less than the sum of the two 
> single words.
>    - Same as above but the cost is computed from the statistics of the 
> training (like the custom dictionary does when you recompile entirely).
>    - Implement longest-match first in the dictionary.
>  
> Thanks for your support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8631) How Nori Tokenizer can deal with Longest-Matching

Reply via email to