[ https://issues.apache.org/jira/browse/LUCENE-8977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Namgyu Kim updated LUCENE-8977: ------------------------------- Description: As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and the others now when there are continuous punctuation marks. (사이즈.... => [사이즈] [.] [...]) But KoreanTokenizer doesn't divide when first character is punctuation. (...사이즈 => [...] [사이즈]) It looks like the result from the viterbi path, but users can think weird about the following case: ("사이즈" means "size" in Korean) ||Case #1||Case #2|| |Input : "...사이즈..."|Input : "...4......4사이즈"| |Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.....] [4] [사이즈]| >From what I checked, Nori has a punctuation characters(like . ,) in the >dictionary but Kuromoji is not. ("サイズ" means "size" in Japanese) ||Case #1||Case #2|| |Input : "...サイズ..."|Input : "...4......4サイズ"| |Result : [...] [サイズ] [...]|Result : [...] [4] [......] [4] [サイズ]| There are some ways to resolve it like hard-coding for punctuation but it seems not good. So I think we need to discuss it. was: As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and the others now when there are continuous punctuation marks. (사이즈.... => [사이즈] [.] [...]) But KoreanTokenizer doesn't divides when first character is punctuation. (...사이즈 => [...] [사이즈]) It looks like the result from the viterbi path, but users can think weird about the following case: ("사이즈" means "size" in Korean) ||Case #1||Case #2|| |Input : "...사이즈..."|Input : "...4......4사이즈"| |Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.....] [4] [사이즈]| >From what I checked, Nori has a punctuation characters(like . ,) in the >dictionary but Kuromoji is not. ("サイズ" means "size" in Japanese) ||Case #1||Case #2|| |Input : "...サイズ..."|Input : "...4......4サイズ"| |Result : [...] [サイズ] [...]|Result : [...] [4] [......] [4] [サイズ]| There are some ways to resolve it like hard-coding for punctuation but it seems not good. So I think we need to discuss it. > Handle punctuation characters in KoreanTokenizer > ------------------------------------------------ > > Key: LUCENE-8977 > URL: https://issues.apache.org/jira/browse/LUCENE-8977 > Project: Lucene - Core > Issue Type: Bug > Reporter: Namgyu Kim > Priority: Minor > > As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and > the others now when there are continuous punctuation marks. > (사이즈.... => [사이즈] [.] [...]) > But KoreanTokenizer doesn't divide when first character is punctuation. > (...사이즈 => [...] [사이즈]) > It looks like the result from the viterbi path, but users can think weird > about the following case: > ("사이즈" means "size" in Korean) > ||Case #1||Case #2|| > |Input : "...사이즈..."|Input : "...4......4사이즈"| > |Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.....] [4] [사이즈]| > From what I checked, Nori has a punctuation characters(like . ,) in the > dictionary but Kuromoji is not. > ("サイズ" means "size" in Japanese) > ||Case #1||Case #2|| > |Input : "...サイズ..."|Input : "...4......4サイズ"| > |Result : [...] [サイズ] [...]|Result : [...] [4] [......] [4] [サイズ]| > There are some ways to resolve it like hard-coding for punctuation but it > seems not good. > So I think we need to discuss it. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org