[jira] [Commented] (LUCENE-8516) Make WordDelimiterGraphFilter a Tokenizer

Mike Sokolov (JIRA) Thu, 04 Oct 2018 05:34:30 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638156#comment-16638156
 ]


Mike Sokolov commented on LUCENE-8516:
--------------------------------------

{quote}Can you elaborate? This rings a bell but I forget. 
{quote}
LUCENE-8450 has the discussion. The basic idea there was to add some methods to 
TokenStream that would be analogous to CharFilter.correctOffset so TokenFilters 
could also apply offsets correctly.

> Make WordDelimiterGraphFilter a Tokenizer
> -----------------------------------------
>
>                 Key: LUCENE-8516
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8516
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8516.patch
>
>
> Being able to split tokens up at arbitrary points in a filter chain, in 
> effect adding a second round of tokenization, can cause any number of 
> problems when trying to keep tokenstreams to contract.  The most common 
> offender here is the WordDelimiterGraphFilter, which can produce broken 
> offsets in a wide range of situations.
> We should make WDGF a Tokenizer in its own right, which should preserve all 
> the functionality we need, but make reasoning about the resulting tokenstream 
> much simpler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8516) Make WordDelimiterGraphFilter a Tokenizer

Reply via email to