Hello,

I'm using lucene 6.2.0 and expecting the following test to pass:

import org.apache.lucene.analysis.BaseTokenStreamTestCase;
import org.apache.lucene.analysis.standard.StandardTokenizer;

import java.io.IOException;
import java.io.StringReader;

public class TestStandardTokenizer extends BaseTokenStreamTestCase
{
    public void testLongToken() throws IOException
    {
        final StandardTokenizer tokenizer = new StandardTokenizer();
        final int maxTokenLength = tokenizer.getMaxTokenLength();

        // string with the following contents: a...maxTokenLength+5 times...a 
abc
        final String longToken = new String(new char[maxTokenLength + 
5]).replace("\0", "a") + " abc";

        tokenizer.setReader(new StringReader(longToken));
        
        assertTokenStreamContents(tokenizer, new String[]{"abc"});
        // actual contents: "a" 255 times, "aaaaa", "abc"
    }
}

It seems like StandardTokenizer considers completely filled buffer as a 
successfully extracted token (1), and also includes tail of too-long-token as a 
separate token (2). Maybe (1) is disputable (I think it is bug), but I think 
(2) is a bug. 

Best regards,
Alexey Makeev
makeev...@mail.ru

Reply via email to