Thanks Robert, Uwe - all this is enlightening. I didn't know about those things you mentioned.
Dawid On Sat, Nov 11, 2023 at 2:02 PM Uwe Schindler <u...@thetaphi.de> wrote: > Hi Dawid, > > the ASCII folding filter is meant to remove accents. You would like to > have searching for visually similar characters. These are 2 different > things. > > Actually Robert also has some config options, waht I generally use for > wester european searches where some documents may contain names of people > (Author names, titles in cyrillic or other languages) it to convert the > tokens using ICU transliteration (use one of the ICU folding filters with > the below config): > > Transliterator.getInstance("Any-Latin; NFD; [:Nonspacing Mark:] Remove; > NFKC; CaseFold", Transliterator.FORWARD); > > This does convert everything to latin characters in a language-neutral way > and then removes all accents by the trick "decompose, remove non-spacing > mark, compose again and case-fold the result. > > Uwe > Am 10.11.2023 um 19:03 schrieb Dawid Weiss: > > > Hi Steve, Chris, > > Ok, makes sense. Thanks for the pointers. I agree the justification for > the use of character-level normalization filters is highly > context-dependent (for example, unsuitable when mixed languages are present > on input). > > Dawid > > On Fri, Nov 10, 2023 at 6:58 PM Chris Hostetter <hossman_luc...@fucit.org> > wrote: > >> >> : Here's the unicode letter after "th": >> : https://www.fileformat.info/info/unicode/char/0435/index.htm >> : >> : To my surprise, I couldn't find it in the ascii folding filter: >> : >> : >> https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java >> : >> : Anybody remembers whether the omission of Cyrillic characters was >> : intentional (there is quite a few of them that are nearly identical in >> : appearance to Latin letters). >> >> From the javadocs, i'm going to guess it's because the the filter focuses >> on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL LETTER IE" >> isn't described as being a "(adjective) LATIN noun (WITH noun)" like all >> of the other characters that are considered to have a direct mapping to >> the "ASCII" / latin characters. >> >> If you look back at when it was added... >> >> https://issues.apache.org/jira/browse/LUCENE-1390 >> >> ...the original focus was on deprecating "ISOLatin1AccentFilter" and >> replacing it with "a more comprehensive version of this code that >> included >> not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin >> Extended A unicode blocks." (The originally proposed name was >> 'ISOLatinAccentFilter') ... subsequent discussion focused on adding more >> Latin blocks. >> >> There was a related issue at the time which initially aimed to add a >> more general "UnicodeNormalizationFilter" that ultimated resulted in >> adding the "ICU" analysis classes... >> >> https://issues.apache.org/jira/browse/LUCENE-1343 >> >> ..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i haven't >> tested that) >> >> >> >> -Hoss >> http://www.lucidworks.com/ >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> -- > Uwe Schindler > Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de > eMail: u...@thetaphi.de > >