[ 
https://issues.apache.org/jira/browse/LUCENE-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13653011#comment-13653011
 ] 

Adrien Grand commented on LUCENE-3907:
--------------------------------------

bq. "previous behavior" (incremented position) is simply NOT linked to front 
vs. back. I'm not sure why you are claiming that it is!

Indeed these issues are unrelated, and backward n-graming doesn't cause 
highlighting issues. Sorry if I seemed to mean the opposite, it was not 
intentional.

My main motivation was to fix the positions/offsets bugs. I also deprecated 
support for backward n-graming since there seemed to be lazy consensus: as Uwe 
noted, backward n-graming can be obtained by applying ReverseStringFilter, then 
EdgeNGramTokenFilter and then ReverseStringFilter again. This helps make 
filters simpler, hence easier to understand and to test.

So now, here is how you would use filters depending on whether you want front 
or back n-graming and with or without the new positions/offsets.

| | previous positions/offsets (broken) | new positions/offsets |
| front n-graming | EdgeNGramTokenFilter(version=LUCENE_43,side=FRONT) | 
EdgeNGramTokenFilter(version=LUCENE_44,side=FRONT) |
| back n-graming | EdgeNGramTokenFilter(version=LUCENE_43,side=BACK) | 
ReverseStringFilter, EdgeNGramTokenFilter(version=LUCENE_44,side=FRONT), 
ReverseStringFilter |

It is true that the patch prevents users from constructing EdgeNGramTokenFilter 
with version>=LUCENE_44 and side=BACK to encourage users to upgrade their 
analysis chain. But if you think we should allow for it, I'm open for 
discussion.
                
> Improve the Edge/NGramTokenizer/Filters
> ---------------------------------------
>
>                 Key: LUCENE-3907
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3907
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Adrien Grand
>              Labels: gsoc2013
>             Fix For: 4.3
>
>         Attachments: LUCENE-3907.patch
>
>
> Our ngram tokenizers/filters could use some love.  EG, they output ngrams in 
> multiple passes, instead of "stacked", which messes up offsets/positions and 
> requires too much buffering (can hit OOME for long tokens).  They clip at 
> 1024 chars (tokenizers) but don't (token filters).  The split up surrogate 
> pairs incorrectly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to