[jira] [Commented] (LUCENE-8450) Enable TokenFilters to assign offsets when splitting tokens

Mike Sokolov (JIRA) Sun, 19 Aug 2018 05:14:28 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585134#comment-16585134
 ]


Mike Sokolov commented on LUCENE-8450:
--------------------------------------

I was trying to think about how to make progress here, but I am still hung up 
on the "bug" Uwe pointed out:
{quote}... In general decompounders should always be directly after the 
tokenizer (some of them may need to lowercase currently to process the token 
like dictionary based decompounders, but that's a bug, IMHO).
{quote}
then thinking oh maybe lowercasing can be handled by a charfilter, but then 
some token filters will want access to the original case, so it's a conundrum. 
If you have a sequence of text processors you really do want to be able to 
choose their order flexibly. If we make hard structural constraints like this, 
it seems to be creating more problems that will be difficult to solve. Do we 
have ideas about how to handle this case-folding in decompounders in this 
scheme? Would we just fold the case-folding into the decompounder? What about 
other character normalization like ß =>  "ss", ligatures, accent-folding and so 
on? Does the decompounder implement all that internally? Do we force that to 
happen in a CharFilter if you want to use a decompounder?

> Enable TokenFilters to assign offsets when splitting tokens
> -----------------------------------------------------------
>
>                 Key: LUCENE-8450
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8450
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Major
>         Attachments: offsets.patch
>
>
> CharFilters and TokenFilters may alter token lengths, meaning that subsequent 
> filters cannot perform simple arithmetic to calculate the original 
> ("correct") offset of a character in the interior of the token. A similar 
> situation exists for Tokenizers, but these can call 
> CharFilter.correctOffset() to map offsets back to their original location in 
> the input stream. There is no such API for TokenFilters.
> This issue calls for adding an API to support use cases like highlighting the 
> correct portion of a compound token. For example the german word 
> "außerstand" (meaning afaict "unable to do something") will be decompounded 
> and match "stand and "ausser", but as things are today, offsets are always 
> set using the start and end of the tokens produced by Tokenizer, meaning that 
> highlighters will match the entire compound.
> I'm proposing to add this method to `TokenStream`:
> {{     public CharOffsetMap getCharOffsetMap();}}
> referencing a CharOffsetMap with these methods:
> {{     int correctOffset(int currentOff);}}
>  {{     int uncorrectOffset(int originalOff);}}
>  
> The uncorrectOffset method is a pseudo-inverse of correctOffset, mapping from 
> original offset forward to the current "offset space".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8450) Enable TokenFilters to assign offsets when splitting tokens

Reply via email to