[
https://issues.apache.org/jira/browse/LUCENE-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631638#comment-16631638
]
Alan Woodward commented on LUCENE-8516:
---------------------------------------
Here's a first stab at a patch, which largely just copies existing WDGF
functionality. WordDelimiterTokenizer takes a root tokenizer (so you could
base it on standard, keyword or whitespace and still get the extra level of
tokenization you need) and then applies its extra tokenization on top.
* I've removed the 'english possessive' option as we have an existing filter
that will do that
* I've kept configuration flags, but this may be an opportunity to make the API
easier to use - for example, we could make WordDelimiterIterator an abstract
class with an overridable isBreak(int previous, int current) method
> Make WordDelimiterGraphFilter a Tokenizer
> -----------------------------------------
>
> Key: LUCENE-8516
> URL: https://issues.apache.org/jira/browse/LUCENE-8516
> Project: Lucene - Core
> Issue Type: Task
> Reporter: Alan Woodward
> Assignee: Alan Woodward
> Priority: Major
> Attachments: LUCENE-8516.patch
>
>
> Being able to split tokens up at arbitrary points in a filter chain, in
> effect adding a second round of tokenization, can cause any number of
> problems when trying to keep tokenstreams to contract. The most common
> offender here is the WordDelimiterGraphFilter, which can produce broken
> offsets in a wide range of situations.
> We should make WDGF a Tokenizer in its own right, which should preserve all
> the functionality we need, but make reasoning about the resulting tokenstream
> much simpler.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]