Namgyu Kim commented on LUCENE-8966:

Oh, Thank you for your reply. [~jim.ferenczi] :D

I checked again and it was not bug.
 That result is come from viterbi path.

But I think it needs to be discussed.
 So I added a new issue about it. 

I'd appreciate if you check LUCENE-8977.

P.S. +1 to your patch

> KoreanTokenizer should split unknown words on digits
> ----------------------------------------------------
>                 Key: LUCENE-8966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8966
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>         Attachments: LUCENE-8966.patch, LUCENE-8966.patch
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]

This message was sent by Atlassian Jira

To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to