Hello,
I'm using lucene 6.2.0 and expecting the following test to pass:
import org.apache.lucene.analysis.BaseTokenStreamTestCase;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import java.io.IOException;
import java.io.StringReader;
public class TestStandardTokenizer extends BaseTokenStreamTestCase
{
public void testLongToken() throws IOException
{
final StandardTokenizer tokenizer = new StandardTokenizer();
final int maxTokenLength = tokenizer.getMaxTokenLength();
// string with the following contents: a...maxTokenLength+5 times...a
abc
final String longToken = new String(new char[maxTokenLength +
5]).replace("\0", "a") + " abc";
tokenizer.setReader(new StringReader(longToken));
assertTokenStreamContents(tokenizer, new String[]{"abc"});
// actual contents: "a" 255 times, "aaaaa", "abc"
}
}
It seems like StandardTokenizer considers completely filled buffer as a
successfully extracted token (1), and also includes tail of too-long-token as a
separate token (2). Maybe (1) is disputable (I think it is bug), but I think
(2) is a bug.
Best regards,
Alexey Makeev
[email protected]