[jira] [Updated] (LUCENE-8977) Handle punctuation characters in KoreanTokenizer

Namgyu Kim (Jira) Wed, 11 Sep 2019 12:16:57 -0700


     [ 
https://issues.apache.org/jira/browse/LUCENE-8977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Namgyu Kim updated LUCENE-8977:
-------------------------------
    Description: 
As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and the 
others now when there are continuous punctuation marks.
 (사이즈.... => [사이즈] [.] [...])
 But KoreanTokenizer doesn't divide when first character is punctuation.
 (...사이즈 => [...] [사이즈])

It looks like the result from the viterbi path, but users can think weird about 
the following case:
 ("사이즈" means "size" in Korean)
||Case #1||Case #2||
|Input : "...사이즈..."|Input : "...4......4사이즈"|
|Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.....] [4] [사이즈]|

>From what I checked, Nori has a punctuation characters(like . ,) in the 
>dictionary but Kuromoji is not.
 ("サイズ" means "size" in Japanese)
||Case #1||Case #2||
|Input : "...サイズ..."|Input : "...4......4サイズ"|
|Result : [...] [サイズ] [...]|Result : [...] [4] [......] [4] [サイズ]|

There are some ways to resolve it like hard-coding for punctuation but it seems 
not good.
 So I think we need to discuss it.

  was:
As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and the 
others now when there are continuous punctuation marks.
 (사이즈.... => [사이즈] [.] [...])
 But KoreanTokenizer doesn't divides when first character is punctuation.
 (...사이즈 => [...] [사이즈])

It looks like the result from the viterbi path, but users can think weird about 
the following case:
 ("사이즈" means "size" in Korean)
||Case #1||Case #2||
|Input : "...사이즈..."|Input : "...4......4사이즈"|
|Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.....] [4] [사이즈]|

>From what I checked, Nori has a punctuation characters(like . ,) in the 
>dictionary but Kuromoji is not.
 ("サイズ" means "size" in Japanese)
||Case #1||Case #2||
|Input : "...サイズ..."|Input : "...4......4サイズ"|
|Result : [...] [サイズ] [...]|Result : [...] [4] [......] [4] [サイズ]|

There are some ways to resolve it like hard-coding for punctuation but it seems 
not good.
 So I think we need to discuss it.


> Handle punctuation characters in KoreanTokenizer
> ------------------------------------------------
>
>                 Key: LUCENE-8977
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8977
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Namgyu Kim
>            Priority: Minor
>
> As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and 
> the others now when there are continuous punctuation marks.
>  (사이즈.... => [사이즈] [.] [...])
>  But KoreanTokenizer doesn't divide when first character is punctuation.
>  (...사이즈 => [...] [사이즈])
> It looks like the result from the viterbi path, but users can think weird 
> about the following case:
>  ("사이즈" means "size" in Korean)
> ||Case #1||Case #2||
> |Input : "...사이즈..."|Input : "...4......4사이즈"|
> |Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.....] [4] [사이즈]|
> From what I checked, Nori has a punctuation characters(like . ,) in the 
> dictionary but Kuromoji is not.
>  ("サイズ" means "size" in Japanese)
> ||Case #1||Case #2||
> |Input : "...サイズ..."|Input : "...4......4サイズ"|
> |Result : [...] [サイズ] [...]|Result : [...] [4] [......] [4] [サイズ]|
> There are some ways to resolve it like hard-coding for punctuation but it 
> seems not good.
>  So I think we need to discuss it.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8977) Handle punctuation characters in KoreanTokenizer

Reply via email to