Mike Sokolov created LUCENE-8450:
------------------------------------

             Summary: Enable TokenFilters to assign offsets when splitting 
tokens
                 Key: LUCENE-8450
                 URL: https://issues.apache.org/jira/browse/LUCENE-8450
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Mike Sokolov
         Attachments: offsets.patch

CharFilters and TokenFilters may alter token lengths, meaning that subsequent 
filters cannot perform simple arithmetic to calculate the original ("correct") 
offset of a character in the interior of the token. A similar situation exists 
for Tokenizers, but these can call CharFilter.correctOffset() to map offsets 
back to their original location in the input stream. There is no such API for 
TokenFilters.

This issue calls for adding an API to support use cases like highlighting the 
correct portion of a compound token. For example the german word "au­ßer­stand" 
(meaning afaict "unable to do something") will be decompounded and match "stand 
and "ausser", but as things are today, offsets are always set using the start 
and end of the tokens produced by Tokenizer, meaning that highlighters will 
match the entire compound.

I'm proposing to add this method to `TokenStream`:


{{     public CharOffsetMap getCharOffsetMap()­­­}}

referencing a CharOffsetMap with these methods:


{{     int correctOffset(int currentOff);}}
{{     int uncorrectOffset(int originalOff);}}

 

The uncorrectOffset method is a pseudo-inverse of correctOffset, mapping from 
original offset forward to the current "offset space".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to