[
https://issues.apache.org/jira/browse/LUCENE-8526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640245#comment-16640245
]
Jim Ferenczi commented on LUCENE-8526:
--------------------------------------
Ok thanks for explaining [~steve_rowe]. I thought that script boundary break
was part of the UAX#29 and that the ICUTokenizer and StandardTokenizer should
behave the same regarding CJK splits. We can maybe add a note in the CJKBigram
filter regarding this behavior when the StandardTokenizer is used ?
> StandardTokenizer doesn't separate hangul characters from other non-CJK chars
> -----------------------------------------------------------------------------
>
> Key: LUCENE-8526
> URL: https://issues.apache.org/jira/browse/LUCENE-8526
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Jim Ferenczi
> Priority: Minor
>
> It was first reported here
> https://github.com/elastic/elasticsearch/issues/34285.
> I don't know if it's the expected behavior but the StandardTokenizer does not
> split words
> which are composed of a mixed of non-CJK characters and hangul syllabs. For
> instance "한국2018" or "한국abc" is kept as is by this tokenizer and mark as an
> alpha-numeric group. This breaks the CJKBigram token filter which will not
> build bigrams on such groups. The other CJK characters are correctly splitted
> when they are mixed with other alphabet so I'd expect the same for hangul.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]