[
https://issues.apache.org/jira/browse/LUCENE-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574854#comment-16574854
]
Mike Sokolov commented on LUCENE-8450:
--------------------------------------
This approach does not impose any requirement to add any logic to all existing
token filters. To get the benefit, it's only really necessary to change filters
that change the length of tokens, and there are pretty few of these. As far as
"tokenizing" token filters, they are basically operating as consumers of
corrected offsets, and we can choose to leave the situation as is on a
case-by-case basis for these. We can continue to use the existing full-width
offsets for the generated sub-tokens, just ignoring this correction API, and
fix only the ones we want to.
I agree that tokenizer should in general be the class for splitting tokens, but
there are reasons why these other filters have been implemented as they are; I
mentioned some regarding DictionaryCompoundWordTokenFilter. I don't really know
about the design of CJKBigramFilter, but it seems it also relies on
ICUTokenizer running prior?
> Enable TokenFilters to assign offsets when splitting tokens
> -----------------------------------------------------------
>
> Key: LUCENE-8450
> URL: https://issues.apache.org/jira/browse/LUCENE-8450
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Mike Sokolov
> Priority: Major
> Attachments: offsets.patch
>
>
> CharFilters and TokenFilters may alter token lengths, meaning that subsequent
> filters cannot perform simple arithmetic to calculate the original
> ("correct") offset of a character in the interior of the token. A similar
> situation exists for Tokenizers, but these can call
> CharFilter.correctOffset() to map offsets back to their original location in
> the input stream. There is no such API for TokenFilters.
> This issue calls for adding an API to support use cases like highlighting the
> correct portion of a compound token. For example the german word
> "außerstand" (meaning afaict "unable to do something") will be decompounded
> and match "stand and "ausser", but as things are today, offsets are always
> set using the start and end of the tokens produced by Tokenizer, meaning that
> highlighters will match the entire compound.
> I'm proposing to add this method to `TokenStream`:
> {{ public CharOffsetMap getCharOffsetMap();}}
> referencing a CharOffsetMap with these methods:
> {{ int correctOffset(int currentOff);}}
> {{ int uncorrectOffset(int originalOff);}}
>
> The uncorrectOffset method is a pseudo-inverse of correctOffset, mapping from
> original offset forward to the current "offset space".
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]