The bug here (in my opinion) is that ThaiWordFilter is a filter at all
(it should be a tokenizer). Like WDF and other filters that really
should be tokenizers, It doesn't expect and can't handle arbitrary
input correctly (e.g. thats been through a shingle filter...)

Another problem is that offsetsAreCorrect=false allows for offsets to
"go backwards" in the stream. But this leniency is a false sense of
security, because if you add a shingle filter then you have a
situation like this where startOffset > endOffset.

On Fri, Nov 9, 2012 at 7:31 AM, Apache Jenkins Server
<[email protected]> wrote:
> Error Message:
> startOffset must be non-negative, and endOffset must be >= startOffset, 
> startOffset=5,endOffset=3
>
> Stack Trace:
> java.lang.IllegalAr> [junit4:junit4]   2> Exception from random analyzer:
> [junit4:junit4]   2> charfilters=
> [junit4:junit4]   2> tokenizer=
> [junit4:junit4]   2>   
> org.apache.lucene.analysis.core.WhitespaceTokenizer(LUCENE_50, 
> org.apache.lucene.analysis.core.TestRandomChains$CheckThatYouDidntReadAnythingReaderWrapper@7f4aaa58)
> [junit4:junit4]   2> filters=
> [junit4:junit4]   2>   
> org.apache.lucene.analysis.miscellaneous.LengthFilter(false, 
> org.apache.lucene.analysis.ValidatingTokenFilter@1, -30, 69)
> [junit4:junit4]   2>   
> org.apache.lucene.analysis.shingle.ShingleFilter(org.apache.lucene.analysis.ValidatingTokenFilter@37caea,
>  tpzabzsxye)
> [junit4:junit4]   2>   
> org.apache.lucene.analysis.th.ThaiWordFilter(LUCENE_50, 
> org.apache.lucene.analysis.ValidatingTokenFilter@37caea)
> [junit4:junit4]   2>   
> org.apache.lucene.analysis.shingle.ShingleFilter(org.apache.lucene.analysis.ValidatingTokenFilter@37caea)
> [junit4:junit4]   2> offsetsAreCorrect=false

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to