[jira] [Comment Edited] (LUCENE-8516) Make WordDelimiterGraphFilter a Tokenizer

Mike Sokolov (JIRA) Mon, 01 Oct 2018 08:52:19 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634218#comment-16634218
 ]


Mike Sokolov edited comment on LUCENE-8516 at 10/1/18 3:51 PM:
---------------------------------------------------------------

Thanks for copy/paste, [~romseygeek], I meant to reply-all, but fumble fingered 
on the phone. If we can get TokenFilters to stop messing with offsets, then 
maybe we can kill OffsetAttribute altogether. I feel like that is the vision we 
are groping towards? Either that or support Offsets in a first-class way in 
TokenFilters, but nobody seems to want to do that, Do we agree those are the 
choices? In the end, highlighting the entire original token (even when your 
query only really matches a piece of it) doesn't seem so terrible. I would 
advocate for fixing the API problems first by tightening the API around 
offsets, and then later if we want to make it possible to do more precise 
offsets / multiple passes of token splitting, we can maybe find a way to do 
that, but the "highlight a subtoken doesn't work" seems like a relatively minor 
problem, not really deserving of major efforts to support it.


was (Author: sokolov):
Thanks for copy/paste, [~romseygeek], I meant to reply-all, but fumble fingered 
on the phone. If we can get TokenFilters to stop messing with offsets, then 
maybe we can kill OffsetAttribute altogether. I feel like that is the vision we 
are groping towards? Either that or support Offsets in a first-class way in 
TokenFilters, but nobody seems to want to do that, Do we agree those are the 
choices? In the end, highlighting the entire original token (even when your 
query only really matches a piece of it) doesn't seem so terrible. I would 
advocate for fixing the API problems first by tightening the API around 
offsets, and then later if we want to make it possible to do more precise 
offsets / multiple passes of token splitting, we can maybe find a way to do 
that, but the highlight at least seems like a relatively minor problem.

> Make WordDelimiterGraphFilter a Tokenizer
> -----------------------------------------
>
>                 Key: LUCENE-8516
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8516
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8516.patch
>
>
> Being able to split tokens up at arbitrary points in a filter chain, in 
> effect adding a second round of tokenization, can cause any number of 
> problems when trying to keep tokenstreams to contract.  The most common 
> offender here is the WordDelimiterGraphFilter, which can produce broken 
> offsets in a wide range of situations.
> We should make WDGF a Tokenizer in its own right, which should preserve all 
> the functionality we need, but make reasoning about the resulting tokenstream 
> much simpler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8516) Make WordDelimiterGraphFilter a Tokenizer

Reply via email to