[
https://issues.apache.org/jira/browse/LUCENE-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477115
]
Doron Cohen commented on LUCENE-759:
------------------------------------
I have two comments/questions on the n-gram tokenizers:
(1) Seems that only the first 1024 characters of the input are handled, and the
rest is ignored (and I think as result the input stream would remain dangling
open).
If you add this test case:
/**
* Test that no ngrams are lost, even for really long inputs
* @throws EXception
*/
public void testLongerInput() throws Exception {
int expectedNumTokens = 1024;
int ngramLength = 2;
// prepare long string
StringBuffer sb = new StringBuffer();
while (sb.length()<expectedNumTokens+ngramLength-1)
sb.append('a');
StringReader longStringReader = new StringReader (sb.toString());
NGramTokenizer tokenizer = new NGramTokenizer(longStringReader,
ngramLength, ngramLength);
int numTokens = 0;
Token token;
while ((token = tokenizer.next())!=null) {
numTokens++;
assertEquals("aa",token.termText());
}
assertEquals("wrong number of tokens",expectedNumTokens,numTokens);
}
With expectedNumTokens = 1023 it would pass, but any larger number would fail.
(2) It seems safer to read the characters like this
int n = input.read(chars);
inStr = new String(chars, 0, n);
(This way not counting on String.trim(), which does work, but worries me).
> Add n-gram tokenizers to contrib/analyzers
> ------------------------------------------
>
> Key: LUCENE-759
> URL: https://issues.apache.org/jira/browse/LUCENE-759
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Otis Gospodnetic
> Assigned To: Otis Gospodnetic
> Priority: Minor
> Fix For: 2.2
>
> Attachments: LUCENE-759.patch, LUCENE-759.patch, LUCENE-759.patch
>
>
> It would be nice to have some n-gram-capable tokenizers in contrib/analyzers.
> Patch coming shortly.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]