We have our own Analyzer which has the following
Public final TokenStream tokenStream(String fieldname, Reader reader) { TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new MyAccentFilter(result); result = new LowerCaseFilter(result); result = new StopFilter(result); return result; } In 2.3.2 if the token ‘Cómo’ came through this it would get changed to ‘como’ by the time it made it through the filters. In 2.4.0 this isn’t the case. It treats this one token as two so we get ‘co’ and ‘mo’. So instead of search ‘como’ or ‘Cómo’ to get all the hits we now have to do them separately. I switched to the WhitespaceTokenizer as a test and that is indexing and searching the way we expect it, but I haven’t looked into what we lost by using that tokenizer. Were we relying on a bug to get what we wanted from StandardTokenizer or did something break in 2.4.0?