[
https://issues.apache.org/jira/browse/LUCENE-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13653011#comment-13653011
]
Adrien Grand commented on LUCENE-3907:
--------------------------------------
bq. "previous behavior" (incremented position) is simply NOT linked to front
vs. back. I'm not sure why you are claiming that it is!
Indeed these issues are unrelated, and backward n-graming doesn't cause
highlighting issues. Sorry if I seemed to mean the opposite, it was not
intentional.
My main motivation was to fix the positions/offsets bugs. I also deprecated
support for backward n-graming since there seemed to be lazy consensus: as Uwe
noted, backward n-graming can be obtained by applying ReverseStringFilter, then
EdgeNGramTokenFilter and then ReverseStringFilter again. This helps make
filters simpler, hence easier to understand and to test.
So now, here is how you would use filters depending on whether you want front
or back n-graming and with or without the new positions/offsets.
| | previous positions/offsets (broken) | new positions/offsets |
| front n-graming | EdgeNGramTokenFilter(version=LUCENE_43,side=FRONT) |
EdgeNGramTokenFilter(version=LUCENE_44,side=FRONT) |
| back n-graming | EdgeNGramTokenFilter(version=LUCENE_43,side=BACK) |
ReverseStringFilter, EdgeNGramTokenFilter(version=LUCENE_44,side=FRONT),
ReverseStringFilter |
It is true that the patch prevents users from constructing EdgeNGramTokenFilter
with version>=LUCENE_44 and side=BACK to encourage users to upgrade their
analysis chain. But if you think we should allow for it, I'm open for
discussion.
> Improve the Edge/NGramTokenizer/Filters
> ---------------------------------------
>
> Key: LUCENE-3907
> URL: https://issues.apache.org/jira/browse/LUCENE-3907
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Assignee: Adrien Grand
> Labels: gsoc2013
> Fix For: 4.3
>
> Attachments: LUCENE-3907.patch
>
>
> Our ngram tokenizers/filters could use some love. EG, they output ngrams in
> multiple passes, instead of "stacked", which messes up offsets/positions and
> requires too much buffering (can hit OOME for long tokens). They clip at
> 1024 chars (tokenizers) but don't (token filters). The split up surrogate
> pairs incorrectly.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]