[
https://issues.apache.org/jira/browse/LUCENE-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-5042:
---------------------------------
Attachment: LUCENE-5042.patch
Patch:
* Computes n-grams based on unicode code points instead of java chars
* Adds the ability to split the input stream on some chars like CharTokenizer
> Improve NGramTokenizer
> ----------------------
>
> Key: LUCENE-5042
> URL: https://issues.apache.org/jira/browse/LUCENE-5042
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Fix For: 5.0, 4.4
>
> Attachments: LUCENE-5042.patch
>
>
> Now that we fixed NGramTokenizer and NGramTokenFilter to not produce corrupt
> token streams, the only way to have "true" offsets for n-grams is to use the
> tokenizer (the filter emits the offsets of the original token).
> Yet, our NGramTokenizer has a few flaws, in particular:
> - it doesn't have the ability to pre-tokenize the input stream, for example
> on whitespaces,
> - it doesn't play nice with surrogate pairs.
> Since we already broke backward compatibility for it in 4.4, I'd like to also
> fix these issues before we release.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]