Namgyu Kim created LUCENE-8977:
----------------------------------
Summary: Handle punctuation characters in KoreanTokenizer
Key: LUCENE-8977
URL: https://issues.apache.org/jira/browse/LUCENE-8977
Project: Lucene - Core
Issue Type: Bug
Reporter: Namgyu Kim
As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and the
others now when there are continuous punctuation marks.
(사이즈.... => [사이즈] [.] [...])
But KoreanTokenizer doesn't divides when first character is punctuation.
(...사이즈 => [...] [사이즈])
It looks like the result from the viterbi path, but users can think weird about
the following case:
("사이즈" means "size" in Korean)
||Case #1||Case #2||
|Input : "...사이즈..."|Input : "...4......4사이즈"|
|Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.....] [4] [사이즈]|
>From what I checked, Nori has a punctuation characters(like . ,) in the
>dictionary but Kuromoji is not.
("サイズ" means "size" in Japanese)
||Case #1||Case #2||
|Input : "...サイズ..."|Input : "...4......4サイズ"|
|Result : [...] [サイズ] [...]|Result : [...] [4] [......] [4] [サイズ]|
There are some ways to resolve it like hard-coding for punctuation but it seems
not good.
So I think we need to discuss it.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]