Alan Woodward created LUCENE-8516:
-------------------------------------
Summary: Make WordDelimiterGraphFilter a Tokenizer
Key: LUCENE-8516
URL: https://issues.apache.org/jira/browse/LUCENE-8516
Project: Lucene - Core
Issue Type: Task
Reporter: Alan Woodward
Assignee: Alan Woodward
Being able to split tokens up at arbitrary points in a filter chain, in effect
adding a second round of tokenization, can cause any number of problems when
trying to keep tokenstreams to contract. The most common offender here is the
WordDelimiterGraphFilter, which can produce broken offsets in a wide range of
situations.
We should make WDGF a Tokenizer in its own right, which should preserve all the
functionality we need, but make reasoning about the resulting tokenstream much
simpler.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]