Thanks Robert, Uwe - all this is enlightening. I didn't know about those
things you mentioned.

Dawid

On Sat, Nov 11, 2023 at 2:02 PM Uwe Schindler <u...@thetaphi.de> wrote:

> Hi Dawid,
>
> the ASCII folding filter is meant to remove accents. You would like to
> have searching for visually similar characters. These are 2 different
> things.
>
> Actually Robert also has some config options, waht I generally use for
> wester european searches where some documents may contain names of people
> (Author names, titles in cyrillic or other languages) it to convert the
> tokens using ICU transliteration (use one of the ICU folding filters with
> the below config):
>
> Transliterator.getInstance("Any-Latin; NFD; [:Nonspacing Mark:] Remove;
> NFKC; CaseFold", Transliterator.FORWARD);
>
> This does convert everything to latin characters in a language-neutral way
> and then removes all accents by the trick "decompose, remove non-spacing
> mark, compose again and case-fold the result.
>
> Uwe
> Am 10.11.2023 um 19:03 schrieb Dawid Weiss:
>
>
> Hi Steve, Chris,
>
> Ok, makes sense. Thanks for the pointers. I agree the justification for
> the use of character-level normalization filters is highly
> context-dependent (for example, unsuitable when mixed languages are present
> on input).
>
> Dawid
>
> On Fri, Nov 10, 2023 at 6:58 PM Chris Hostetter <hossman_luc...@fucit.org>
> wrote:
>
>>
>> : Here's the unicode letter after "th":
>> : https://www.fileformat.info/info/unicode/char/0435/index.htm
>> :
>> : To my surprise, I couldn't find it in the ascii folding filter:
>> :
>> :
>> https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
>> :
>> : Anybody remembers whether the omission of Cyrillic characters was
>> : intentional (there is quite a few of them that are nearly identical in
>> : appearance to Latin letters).
>>
>> From the javadocs, i'm going to guess it's because the the filter focuses
>> on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL LETTER IE"
>> isn't described as being a "(adjective) LATIN noun (WITH noun)" like all
>> of the other characters that are considered to have a direct mapping to
>> the "ASCII" / latin characters.
>>
>> If you look back at when it was added...
>>
>> https://issues.apache.org/jira/browse/LUCENE-1390
>>
>> ...the original focus was on deprecating "ISOLatin1AccentFilter" and
>> replacing it with "a more comprehensive version of this code that
>> included
>> not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin
>> Extended A unicode blocks."  (The originally proposed name was
>> 'ISOLatinAccentFilter') ... subsequent discussion focused on adding more
>> Latin blocks.
>>
>> There was a related issue at the time which initially aimed to add a
>> more general "UnicodeNormalizationFilter" that ultimated resulted in
>> adding the "ICU" analysis classes...
>>
>> https://issues.apache.org/jira/browse/LUCENE-1343
>>
>> ..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i haven't
>> tested that)
>>
>>
>>
>> -Hoss
>> http://www.lucidworks.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>

Reply via email to