[ 
https://issues.apache.org/jira/browse/LUCENE-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576633#comment-16576633
 ] 

Robert Muir commented on LUCENE-8450:
-------------------------------------

{quote}
To get the benefit, it's only really necessary to change filters that change 
the length of tokens, and there are pretty few of these
{quote}

Sorry, I think this is wrong. For example most language stemmers are doing 
simple suffix or prefix stripping of tokens. They definitely change the length. 
These are hard enough, why should they have to deal with this artificial 
problem?

In general your problem is caused by "decompounders" that use the wrong base 
class. For chinese, decomposition is a tokenizer. For japanese, its the same. 
For korean, its the same. These aren't the buggy ones.

The problem is stuff like the Decompound* filters geared at stuff like german 
language, and the WordDelimiterFilter, and so on. These should be fixed. Sorry, 
this is honestly still a tokenization problem: breaking the text into 
meaningful tokens. These should not be tokenfilters, that will fix the issue. 

Maybe it makes sense for something like standardtokenizer to offer a 
"decompound hook" or something that is very limited (e.g., not a chain, just 
one thing) so that european language decompounders don't need to duplicate a 
lot of the logic around punctuation and unicode. Perhaps the same functionality 
could be used for "word-delimiter" so that people can have a "unicode standard" 
tokenizer but it just handles some ambiguous cases differently (such as when 
the case-changes and when there are hyphens etc). I think lucene is weak here, 
i don't think we should cancel the issue, but at the same time I don't think we 
should try to give tokenfilter "tokenizer" capabilities just for artificial 
code-reuse purposes: the abstractions need to make sense so that we can prevent 
and detect bugs and do a good job testing and maintain all the code.

> Enable TokenFilters to assign offsets when splitting tokens
> -----------------------------------------------------------
>
>                 Key: LUCENE-8450
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8450
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Major
>         Attachments: offsets.patch
>
>
> CharFilters and TokenFilters may alter token lengths, meaning that subsequent 
> filters cannot perform simple arithmetic to calculate the original 
> ("correct") offset of a character in the interior of the token. A similar 
> situation exists for Tokenizers, but these can call 
> CharFilter.correctOffset() to map offsets back to their original location in 
> the input stream. There is no such API for TokenFilters.
> This issue calls for adding an API to support use cases like highlighting the 
> correct portion of a compound token. For example the german word 
> "au­ßer­stand" (meaning afaict "unable to do something") will be decompounded 
> and match "stand and "ausser", but as things are today, offsets are always 
> set using the start and end of the tokens produced by Tokenizer, meaning that 
> highlighters will match the entire compound.
> I'm proposing to add this method to `TokenStream`:
> {{     public CharOffsetMap getCharOffsetMap()­­­;}}
> referencing a CharOffsetMap with these methods:
> {{     int correctOffset(int currentOff);}}
>  {{     int uncorrectOffset(int originalOff);}}
>  
> The uncorrectOffset method is a pseudo-inverse of correctOffset, mapping from 
> original offset forward to the current "offset space".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to