[
https://issues.apache.org/jira/browse/LUCENE-8631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790369#comment-16790369
]
ASF subversion and git services commented on LUCENE-8631:
---------------------------------------------------------
Commit b1f870a4164769df62b24af63048aa2f9b21af47 in lucene-solr's branch
refs/heads/master from Yeongsu Kim
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b1f870a ]
LUCENE-8631: The Korean user dictionary now picks the longest-matching word and
discards the other matches.
> How Nori Tokenizer can deal with Longest-Matching
> -------------------------------------------------
>
> Key: LUCENE-8631
> URL: https://issues.apache.org/jira/browse/LUCENE-8631
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: Yeongsu Kim
> Priority: Major
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> I think... Nori tokenizer has one issue.
> I don’t understand why “Longest-Matching” is NOT working to Nori tokenizer
> via config mode (config mode:
> [https://www.elastic.co/guide/en/elasticsearch/plugins/6.x/analysis-nori-tokenizer.html).]
>
> Here is an example for explaining what is longest-matching.
> Let assume we have `userdict_ko.txt` including only three Korean single-words
> such as ‘골드’, ‘브라운’, ‘골드브라운’, and save it to Nori analyzer. After update, we
> can see that it outputs two tokens such as ‘골드’ and ‘브라운’, when the input is
> ‘골드브라운’. (In English: ‘골드’ means ‘gold’, ‘브라운’ means ‘brown’, and ‘골드브라운’
> means ‘goldbrown’)
>
> With this result, we recognize that “Longest-Matching” is NOT working. If
> “Longest-Matching” is working, the output must be ‘골드브라운’, which is the
> longest matching word in the user dictionary.
>
> Curiously enough, when we add user dictionary via custom mode (custom mode:
> [https://github.com/jimczi/nori/blob/master/how-to-custom-dict.asciidoc|https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjimczi%2Fnori%2Fblob%2Fmaster%2Fhow-to-custom-dict.asciidoc&data=02%7C01%7Chigh_yeongsu%40wemakeprice.com%7C6953d739414e4da5ad1408d67473a6fe%7C6322d5f522044e9d9ca6d18828a04daf%7C0%7C0%7C636824437418170758&sdata=5iuNvKr8WJCXlCkJQrf5r3BgDVnF5hpG7l%2BQL0Ok7Aw%3D&reserved=0]),
> we found the result is ‘골드브라운’, where ‘Longest-Matching’ is applied. We
> think the reason is because learned Mecab engine automatically generates word
> costs by its own criteria. We hope this mechanism is also applied to config
> mode.
>
> Would you tell me the way to “Longest-Matching” via config mode (not custom)
> or give me some hints (e.g. where to modify source codes) to solve this
> problem?
>
> P.S
> Recently, I've mailed to [~jim.ferenczi], who is a developer of Nori, and
> received his suggestions:
> - Add a way to set a score to each new rule (this way you could set up a
> negative cost for the compound word that is less than the sum of the two
> single words.
> - Same as above but the cost is computed from the statistics of the
> training (like the custom dictionary does when you recompile entirely).
> - Implement longest-match first in the dictionary.
>
> Thanks for your support.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]