[jira] [Updated] (LUCENE-5042) Improve NGramTokenizer

Adrien Grand (JIRA) Sat, 08 Jun 2013 11:39:21 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-5042:
---------------------------------

    Attachment: LUCENE-5042.patch

Thanks for the review Simon, here is a new patch that should satisfy your 
concerns. Additionally,

 * it also fixes the other (edge) n-gram tokenizers and filters

 * I factored out some methods into CharacterUtils

 * I went into a bug because Character.codePointAt(char[], int) doesn't know 
the end of the char[], this  can be a problem when working with buffers which 
are not filled up so I made this API forbidden and fixed other places that 
relied on it: codePointAt(char[], int, int) looks safer to me.

 * I changed the CharacterUtil.fill API so that it reads fully (which it didn't 
do although the documentation stated it does).
                
> Improve NGramTokenizer
> ----------------------
>
>                 Key: LUCENE-5042
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5042
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>             Fix For: 5.0, 4.4
>
>         Attachments: LUCENE-5042.patch, LUCENE-5042.patch
>
>
> Now that we fixed NGramTokenizer and NGramTokenFilter to not produce corrupt 
> token streams, the only way to have "true" offsets for n-grams is to use the 
> tokenizer (the filter emits the offsets of the original token).
> Yet, our NGramTokenizer has a few flaws, in particular:
>  - it doesn't have the ability to pre-tokenize the input stream, for example 
> on whitespaces,
>  - it doesn't play nice with surrogate pairs.
> Since we already broke backward compatibility for it in 4.4, I'd like to also 
> fix these issues before we release.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-5042) Improve NGramTokenizer

Reply via email to