[
https://issues.apache.org/jira/browse/LUCENE-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631705#comment-16631705
]
Alan Woodward commented on LUCENE-8516:
---------------------------------------
It's needed at the moment for the concatenation parameters, in that if you're
stringing terms back together again then you need to know where to stop. Then
again, that's an argument to get rid of concatenization.
IME I've seen WDGF used for two purposes: searching for hyphenated or
apostrophised words, and searching for IDs or manufacturing part numbers.
Concentrating on the second, we could make this tokenizer something like
CharTokenizer only instead of only breaking on specific characters, you can
also break on transitions. For the first, a simple filter that indexes all
subparts of a word without changing offsets (more like a synonym filter) might
be the way forward?
> Make WordDelimiterGraphFilter a Tokenizer
> -----------------------------------------
>
> Key: LUCENE-8516
> URL: https://issues.apache.org/jira/browse/LUCENE-8516
> Project: Lucene - Core
> Issue Type: Task
> Reporter: Alan Woodward
> Assignee: Alan Woodward
> Priority: Major
> Attachments: LUCENE-8516.patch
>
>
> Being able to split tokens up at arbitrary points in a filter chain, in
> effect adding a second round of tokenization, can cause any number of
> problems when trying to keep tokenstreams to contract. The most common
> offender here is the WordDelimiterGraphFilter, which can produce broken
> offsets in a wide range of situations.
> We should make WDGF a Tokenizer in its own right, which should preserve all
> the functionality we need, but make reasoning about the resulting tokenstream
> much simpler.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]