[
https://issues.apache.org/jira/browse/LUCENE-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576353#comment-16576353
]
Michael McCandless commented on LUCENE-8450:
--------------------------------------------
{quote}Separately I don't like the correctOffset() method that we already have
on tokenizer today. maybe it could be in the offsetattributeimpl or similar
instead.
{quote}
I like that idea – it always seemed weird that the {{correctOffset}} was only
available via {{CharFilter}} whereas {{OffsetAttribute}} is really the more
natural place for it.
> Enable TokenFilters to assign offsets when splitting tokens
> -----------------------------------------------------------
>
> Key: LUCENE-8450
> URL: https://issues.apache.org/jira/browse/LUCENE-8450
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Mike Sokolov
> Priority: Major
> Attachments: offsets.patch
>
>
> CharFilters and TokenFilters may alter token lengths, meaning that subsequent
> filters cannot perform simple arithmetic to calculate the original
> ("correct") offset of a character in the interior of the token. A similar
> situation exists for Tokenizers, but these can call
> CharFilter.correctOffset() to map offsets back to their original location in
> the input stream. There is no such API for TokenFilters.
> This issue calls for adding an API to support use cases like highlighting the
> correct portion of a compound token. For example the german word
> "außerstand" (meaning afaict "unable to do something") will be decompounded
> and match "stand and "ausser", but as things are today, offsets are always
> set using the start and end of the tokens produced by Tokenizer, meaning that
> highlighters will match the entire compound.
> I'm proposing to add this method to `TokenStream`:
> {{ public CharOffsetMap getCharOffsetMap();}}
> referencing a CharOffsetMap with these methods:
> {{ int correctOffset(int currentOff);}}
> {{ int uncorrectOffset(int originalOff);}}
>
> The uncorrectOffset method is a pseudo-inverse of correctOffset, mapping from
> original offset forward to the current "offset space".
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]