The bug here (in my opinion) is that ThaiWordFilter is a filter at all (it should be a tokenizer). Like WDF and other filters that really should be tokenizers, It doesn't expect and can't handle arbitrary input correctly (e.g. thats been through a shingle filter...)
Another problem is that offsetsAreCorrect=false allows for offsets to "go backwards" in the stream. But this leniency is a false sense of security, because if you add a shingle filter then you have a situation like this where startOffset > endOffset. On Fri, Nov 9, 2012 at 7:31 AM, Apache Jenkins Server <[email protected]> wrote: > Error Message: > startOffset must be non-negative, and endOffset must be >= startOffset, > startOffset=5,endOffset=3 > > Stack Trace: > java.lang.IllegalAr> [junit4:junit4] 2> Exception from random analyzer: > [junit4:junit4] 2> charfilters= > [junit4:junit4] 2> tokenizer= > [junit4:junit4] 2> > org.apache.lucene.analysis.core.WhitespaceTokenizer(LUCENE_50, > org.apache.lucene.analysis.core.TestRandomChains$CheckThatYouDidntReadAnythingReaderWrapper@7f4aaa58) > [junit4:junit4] 2> filters= > [junit4:junit4] 2> > org.apache.lucene.analysis.miscellaneous.LengthFilter(false, > org.apache.lucene.analysis.ValidatingTokenFilter@1, -30, 69) > [junit4:junit4] 2> > org.apache.lucene.analysis.shingle.ShingleFilter(org.apache.lucene.analysis.ValidatingTokenFilter@37caea, > tpzabzsxye) > [junit4:junit4] 2> > org.apache.lucene.analysis.th.ThaiWordFilter(LUCENE_50, > org.apache.lucene.analysis.ValidatingTokenFilter@37caea) > [junit4:junit4] 2> > org.apache.lucene.analysis.shingle.ShingleFilter(org.apache.lucene.analysis.ValidatingTokenFilter@37caea) > [junit4:junit4] 2> offsetsAreCorrect=false --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
