[jira] [Commented] (LUCENE-8450) Enable TokenFilters to assign offsets when splitting tokens

Robert Muir (JIRA) Thu, 09 Aug 2018 06:15:08 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574825#comment-16574825
 ]


Robert Muir commented on LUCENE-8450:
-------------------------------------

I feel pretty strongly that we shouldn't go this route. The thing dividing text 
up into tokens, tokenizer is the lucene class for that.

Also when i say maintenance side just look at the offensive filters that really 
should be tokenizers: they are all a real nightmare: cjkbigramfilter, 
worddelimiterfilter, etc. These things are monstrously complex and difficult to 
work with: its clearly not what a tokenfilter should be doing.

I don't want to change this situation into "differently broken" where somehow 
now we have to add logic to *hundreds* of tokenfilters when we could just fix 
the *2 or 3* bad ones such as wdf to be tokenizers instead.

> Enable TokenFilters to assign offsets when splitting tokens
> -----------------------------------------------------------
>
>                 Key: LUCENE-8450
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8450
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Major
>         Attachments: offsets.patch
>
>
> CharFilters and TokenFilters may alter token lengths, meaning that subsequent 
> filters cannot perform simple arithmetic to calculate the original 
> ("correct") offset of a character in the interior of the token. A similar 
> situation exists for Tokenizers, but these can call 
> CharFilter.correctOffset() to map offsets back to their original location in 
> the input stream. There is no such API for TokenFilters.
> This issue calls for adding an API to support use cases like highlighting the 
> correct portion of a compound token. For example the german word 
> "außerstand" (meaning afaict "unable to do something") will be decompounded 
> and match "stand and "ausser", but as things are today, offsets are always 
> set using the start and end of the tokens produced by Tokenizer, meaning that 
> highlighters will match the entire compound.
> I'm proposing to add this method to `TokenStream`:
> {{     public CharOffsetMap getCharOffsetMap();}}
> referencing a CharOffsetMap with these methods:
> {{     int correctOffset(int currentOff);}}
>  {{     int uncorrectOffset(int originalOff);}}
>  
> The uncorrectOffset method is a pseudo-inverse of correctOffset, mapping from 
> original offset forward to the current "offset space".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8450) Enable TokenFilters to assign offsets when splitting tokens

Reply via email to