Alan Woodward created LUCENE-8516:
-------------------------------------

             Summary: Make WordDelimiterGraphFilter a Tokenizer
                 Key: LUCENE-8516
                 URL: https://issues.apache.org/jira/browse/LUCENE-8516
             Project: Lucene - Core
          Issue Type: Task
            Reporter: Alan Woodward
            Assignee: Alan Woodward


Being able to split tokens up at arbitrary points in a filter chain, in effect 
adding a second round of tokenization, can cause any number of problems when 
trying to keep tokenstreams to contract.  The most common offender here is the 
WordDelimiterGraphFilter, which can produce broken offsets in a wide range of 
situations.

We should make WDGF a Tokenizer in its own right, which should preserve all the 
functionality we need, but make reasoning about the resulting tokenstream much 
simpler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to