I just stumbled upon this stop word appearing in one of our indexes:

thе

Look closely. Can you see it? I doubt - I couldn't either. This is the hex
dump of that:

74 68 d0 b5

which means

thе and the

are two different things.

Here's the unicode letter after "th":
https://www.fileformat.info/info/unicode/char/0435/index.htm

To my surprise, I couldn't find it in the ascii folding filter:

https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java

Anybody remembers whether the omission of Cyrillic characters was
intentional (there is quite a few of them that are nearly identical in
appearance to Latin letters).

Dawid

Reply via email to