My current usage of this filter requires it to be a filter, since I need to precede it with other filters. I think the idea of not touching offsets preserves more flexibility, and since the offsets are already unreliable, we wouldn't be losing much.
On Sun, Sep 30, 2018, 11:32 AM Alan Woodward (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16633406#comment-16633406 > ] > > Alan Woodward commented on LUCENE-8516: > --------------------------------------- > > Another solution would be for WordDelimiterGraphFilter to no longer amend > offsets. So all token parts would be stored with the offsets of the > original undelimited token. > > > Make WordDelimiterGraphFilter a Tokenizer > > ----------------------------------------- > > > > Key: LUCENE-8516 > > URL: https://issues.apache.org/jira/browse/LUCENE-8516 > > Project: Lucene - Core > > Issue Type: Task > > Reporter: Alan Woodward > > Assignee: Alan Woodward > > Priority: Major > > Attachments: LUCENE-8516.patch > > > > > > Being able to split tokens up at arbitrary points in a filter chain, in > effect adding a second round of tokenization, can cause any number of > problems when trying to keep tokenstreams to contract. The most common > offender here is the WordDelimiterGraphFilter, which can produce broken > offsets in a wide range of situations. > > We should make WDGF a Tokenizer in its own right, which should preserve > all the functionality we need, but make reasoning about the resulting > tokenstream much simpler. > > > > -- > This message was sent by Atlassian JIRA > (v7.6.3#76005) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
