[ 
https://issues.apache.org/jira/browse/LUCENE-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16577055#comment-16577055
 ] 

Robert Muir commented on LUCENE-8450:
-------------------------------------

{quote}
Actually that the real solution for the decompounding or WordDelimiterFilter. 
Actually all tokenizers should support it. Maybe that can be done in the base 
class and the incrementToken() get's final. Instead the parsing code could push 
tokens that are passed to decompounder and then icrementToken returns them. So 
incrementToken is final and calls some next method on the tokenization and 
passes the result to the decompounder. Which is is a no-op by default.
{quote}

If we take this simplistic approach it means the decompounder only sees a 
single token (versus say, entire sentence or phrase)? This might only work for 
"easy" decompounder algorithms like WDF and the german Decompounding* 
implementations. Maybe it is possible to refactor ThaiTokenizer to this and it 
will also be fine? Currently that one gets the context of the whole sentence 
(but I am not sure it needs that / impacts the current underlying algorithm). 

But I think chinese, japanese, and korean tokenizers use more context than just 
one whitespace/punctuation delimited word (n-gram features and so on in the 
model). So its good just to think things through a bit, would be great to 
consolidate a lot of this if we can. 

At the same time I think its ok to make the API limited, you know, if we think 
that will help true use-cases today. So we could just document that if you use 
this decompounder interface that you only see individual delimited tokens and 
not some bigger context. I'm hoping we can avoid situations where the algorithm 
has to capture/restore a bunch of state: if we end out with that, then things 
haven't really gotten any better.


> Enable TokenFilters to assign offsets when splitting tokens
> -----------------------------------------------------------
>
>                 Key: LUCENE-8450
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8450
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Major
>         Attachments: offsets.patch
>
>
> CharFilters and TokenFilters may alter token lengths, meaning that subsequent 
> filters cannot perform simple arithmetic to calculate the original 
> ("correct") offset of a character in the interior of the token. A similar 
> situation exists for Tokenizers, but these can call 
> CharFilter.correctOffset() to map offsets back to their original location in 
> the input stream. There is no such API for TokenFilters.
> This issue calls for adding an API to support use cases like highlighting the 
> correct portion of a compound token. For example the german word 
> "au­ßer­stand" (meaning afaict "unable to do something") will be decompounded 
> and match "stand and "ausser", but as things are today, offsets are always 
> set using the start and end of the tokens produced by Tokenizer, meaning that 
> highlighters will match the entire compound.
> I'm proposing to add this method to `TokenStream`:
> {{     public CharOffsetMap getCharOffsetMap()­­­;}}
> referencing a CharOffsetMap with these methods:
> {{     int correctOffset(int currentOff);}}
>  {{     int uncorrectOffset(int originalOff);}}
>  
> The uncorrectOffset method is a pseudo-inverse of correctOffset, mapping from 
> original offset forward to the current "offset space".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to