My current usage of this filter requires it to be a filter, since I need to
precede it with other filters. I think the idea of not touching offsets
preserves more flexibility, and since the offsets are already unreliable,
we wouldn't be losing much.

On Sun, Sep 30, 2018, 11:32 AM Alan Woodward (JIRA) <[email protected]> wrote:

>
>     [
> https://issues.apache.org/jira/browse/LUCENE-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16633406#comment-16633406
> ]
>
> Alan Woodward commented on LUCENE-8516:
> ---------------------------------------
>
> Another solution would be for WordDelimiterGraphFilter to no longer amend
> offsets.  So all token parts would be stored with the offsets of the
> original undelimited token.
>
> > Make WordDelimiterGraphFilter a Tokenizer
> > -----------------------------------------
> >
> >                 Key: LUCENE-8516
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-8516
> >             Project: Lucene - Core
> >          Issue Type: Task
> >            Reporter: Alan Woodward
> >            Assignee: Alan Woodward
> >            Priority: Major
> >         Attachments: LUCENE-8516.patch
> >
> >
> > Being able to split tokens up at arbitrary points in a filter chain, in
> effect adding a second round of tokenization, can cause any number of
> problems when trying to keep tokenstreams to contract.  The most common
> offender here is the WordDelimiterGraphFilter, which can produce broken
> offsets in a wide range of situations.
> > We should make WDGF a Tokenizer in its own right, which should preserve
> all the functionality we need, but make reasoning about the resulting
> tokenstream much simpler.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to