2.3.2 -> 2.4.0 StandardTokenizer issue

Philip Puffinburger Mon, 16 Feb 2009 16:19:50 -0800

We have our own Analyzer which has the following


Public final TokenStream tokenStream(String fieldname, Reader reader) {

  TokenStream result = new StandardTokenizer(reader);

  result = new StandardFilter(result);

  result = new MyAccentFilter(result);

  result = new LowerCaseFilter(result);

  result = new StopFilter(result);

 

  return result;

}

 

In 2.3.2 if the token ‘Cómo’ came through this it would get changed to
‘como’ by the time it made it through the filters.    In 2.4.0 this isn’t
the case.   It treats this one token as two so we get ‘co’ and ‘mo’.    So
instead of search ‘como’ or ‘Cómo’ to get all the hits we now have to do
them separately.

 

I switched to the WhitespaceTokenizer as a test and that is indexing and
searching the way we expect it, but I haven’t looked into what we lost by
using that tokenizer.

 

Were we relying on a bug to get what we wanted from StandardTokenizer or did
something break in 2.4.0?

2.3.2 -> 2.4.0 StandardTokenizer issue

Reply via email to