Jim Ferenczi created LUCENE-10081:
-------------------------------------
Summary: KoreanTokenizer should check the max backtrace gap on
whitespaces
Key: LUCENE-10081
URL: https://issues.apache.org/jira/browse/LUCENE-10081
Project: Lucene - Core
Issue Type: Bug
Reporter: Jim Ferenczi
Today the KoreanTokenizer keeps track of the whitespaces that appear before a
known term in order to apply a space penalty factor. These whitespaces are
considered part of the next term so the backtrace gap limit is not applied.
As a result, the position buffer can grow up to the maximum number of
consecutive whitespaces in the input. This is problematic since the buffer is
reused on reset() so we should ensure that the max backtrace gap limit is
applied on consecutive whitespaces consistently.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]