[ 
https://issues.apache.org/jira/browse/LUCENE-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574713#comment-16574713
 ] 

Mike Sokolov commented on LUCENE-8450:
--------------------------------------

{quote}I am sorry that some tokenfilters that really should be tokenizers 
extend the wrong base class, but that problem should simply be fixed.
{quote}
A tokenfilter such as decompounding can't really be a tokenizer since it needs 
normalization that is provided by earlier components (at the very least lower 
casing, but also splitting on script changes and other character 
normalization). I guess one could just smoosh all analysis logic into the 
tokenizer, but that really defeats the purpose of the architecture, which 
supports a nicely modular chain of filters.

I suppose there is potential maintenance pain. First though, note that the 
patch I posted here does not actually implement this for all existing 
tokenfilters and tokenizers. It merely demonstrates the approach (and changes 
the implementation, but not the behavior of char filters). We can merge this 
patch and everything will work just as it did before. Once we actually start 
using the API and downstream consumers rely on the offsets being correct-able, 
then there would be some expectation of maintaining that. Let me work those 
changes into the patch for at least ICUFoldingFilter so we can see how 
burdensome that would be. I'll also note that the effect of failing to maintain 
would be just that token-splitting tokenfilters generate broken offsets, as 
they do today, just differently broken.

> Enable TokenFilters to assign offsets when splitting tokens
> -----------------------------------------------------------
>
>                 Key: LUCENE-8450
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8450
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Major
>         Attachments: offsets.patch
>
>
> CharFilters and TokenFilters may alter token lengths, meaning that subsequent 
> filters cannot perform simple arithmetic to calculate the original 
> ("correct") offset of a character in the interior of the token. A similar 
> situation exists for Tokenizers, but these can call 
> CharFilter.correctOffset() to map offsets back to their original location in 
> the input stream. There is no such API for TokenFilters.
> This issue calls for adding an API to support use cases like highlighting the 
> correct portion of a compound token. For example the german word 
> "au­ßer­stand" (meaning afaict "unable to do something") will be decompounded 
> and match "stand and "ausser", but as things are today, offsets are always 
> set using the start and end of the tokens produced by Tokenizer, meaning that 
> highlighters will match the entire compound.
> I'm proposing to add this method to `TokenStream`:
> {{     public CharOffsetMap getCharOffsetMap()­­­;}}
> referencing a CharOffsetMap with these methods:
> {{     int correctOffset(int currentOff);}}
>  {{     int uncorrectOffset(int originalOff);}}
>  
> The uncorrectOffset method is a pseudo-inverse of correctOffset, mapping from 
> original offset forward to the current "offset space".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to