Re: Ascii folding

Robert Muir Fri, 10 Nov 2023 10:23:22 -0800

Sorry, I meant to provide the demo link too, in case you want to play:
https://util.unicode.org/UnicodeJsps/confusables.jsp?a=paypal&r=None


It illustrates how the problem of "visually confusing" is really its
own beast, e.g. confusion of 'L' vs '1' with some fonts.

On Fri, Nov 10, 2023 at 1:13 PM Robert Muir <rcm...@gmail.com> wrote:
>
> For visual confusing characters we have the option to expose specific
> processing for that, e.g.
> https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/text/SpoofChecker.html#getSkeleton-java.lang.CharSequence-
>
> Maybe there are use-cases for a search engine, e.g. find me documents
> with words that "could be confused visually" with 'beer' (or whatever
> the query is). Usually this processing is geared around security
> use-cases.
>
> On Fri, Nov 10, 2023 at 1:03 PM Dawid Weiss <dawid.we...@gmail.com> wrote:
> >
> >
> > Hi Steve, Chris,
> >
> > Ok, makes sense. Thanks for the pointers. I agree the justification for the 
> > use of character-level normalization filters is highly context-dependent 
> > (for example, unsuitable when mixed languages are present on input).
> >
> > Dawid
> >
> > On Fri, Nov 10, 2023 at 6:58 PM Chris Hostetter <hossman_luc...@fucit.org> 
> > wrote:
> >>
> >>
> >> : Here's the unicode letter after "th":
> >> : https://www.fileformat.info/info/unicode/char/0435/index.htm
> >> :
> >> : To my surprise, I couldn't find it in the ascii folding filter:
> >> :
> >> : 
> >> https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
> >> :
> >> : Anybody remembers whether the omission of Cyrillic characters was
> >> : intentional (there is quite a few of them that are nearly identical in
> >> : appearance to Latin letters).
> >>
> >> From the javadocs, i'm going to guess it's because the the filter focuses
> >> on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL LETTER IE"
> >> isn't described as being a "(adjective) LATIN noun (WITH noun)" like all
> >> of the other characters that are considered to have a direct mapping to
> >> the "ASCII" / latin characters.
> >>
> >> If you look back at when it was added...
> >>
> >> https://issues.apache.org/jira/browse/LUCENE-1390
> >>
> >> ...the original focus was on deprecating "ISOLatin1AccentFilter" and
> >> replacing it with "a more comprehensive version of this code that included
> >> not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin
> >> Extended A unicode blocks."  (The originally proposed name was
> >> 'ISOLatinAccentFilter') ... subsequent discussion focused on adding more
> >> Latin blocks.
> >>
> >> There was a related issue at the time which initially aimed to add a
> >> more general "UnicodeNormalizationFilter" that ultimated resulted in
> >> adding the "ICU" analysis classes...
> >>
> >> https://issues.apache.org/jira/browse/LUCENE-1343
> >>
> >> ..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i haven't
> >> tested that)
> >>
> >>
> >>
> >> -Hoss
> >> http://www.lucidworks.com/
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Ascii folding

Reply via email to