[ https://issues.apache.org/jira/browse/LUCENE-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631648#comment-16631648 ]
Robert Muir commented on LUCENE-8516: ------------------------------------- {quote} WordDelimiterTokenizer takes a root tokenizer (so you could base it on standard, keyword or whitespace and still get the extra level of tokenization you need) and then applies its extra tokenization on top. {quote} This seems unnecessary. Its already over-configurable as far as how to break itself. > Make WordDelimiterGraphFilter a Tokenizer > ----------------------------------------- > > Key: LUCENE-8516 > URL: https://issues.apache.org/jira/browse/LUCENE-8516 > Project: Lucene - Core > Issue Type: Task > Reporter: Alan Woodward > Assignee: Alan Woodward > Priority: Major > Attachments: LUCENE-8516.patch > > > Being able to split tokens up at arbitrary points in a filter chain, in > effect adding a second round of tokenization, can cause any number of > problems when trying to keep tokenstreams to contract. The most common > offender here is the WordDelimiterGraphFilter, which can produce broken > offsets in a wide range of situations. > We should make WDGF a Tokenizer in its own right, which should preserve all > the functionality we need, but make reasoning about the resulting tokenstream > much simpler. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org