Jim Ferenczi created LUCENE-8526:
------------------------------------
Summary: StandardTokenizer doesn't separate hangul characters from
other non-CJK chars
Key: LUCENE-8526
URL: https://issues.apache.org/jira/browse/LUCENE-8526
Project: Lucene - Core
Issue Type: Improvement
Reporter: Jim Ferenczi
It was first reported here
https://github.com/elastic/elasticsearch/issues/34285.
I don't know if it's the expected behavior but the StandardTokenizer does not
split words
which are composed of a mixed of non-CJK characters and hangul syllabs. For
instance "한국2018" or "한국abc" is kept as is by this tokenizer and mark as an
alpha-numeric group. This breaks the CJKBigram token filter which will not
build bigrams on such groups. The other CJK characters are correctly splitted
when they are mixed with other alphabet so I'd expect the same for hangul.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]