[email protected]

 

   ----- Forwarded Message ----- From: David G. Pickett <[email protected]>To: 
Dave <[email protected]>Sent: Sunday, October 26, 2025 at 06:07:02 PM 
EDTSubject: Re: bug#79702: request: flag for visually identical but different 
unicode characters  
  Even before hackers were using Cyrillic - Roman lookalikes for fake URLs 
(e.g., chase.com with a Cyrillic a), I recall Sybase doing insensitivity both 
of case and of Nordic markups in iso-8859-1, like 'A' with a umlaut 'Ä', in 
string indexes, so this is not a new idea!  I am not sure of the utility in 
practical terms.  Who gets to identify the look-alikes?


    On Sunday, October 26, 2025 at 09:54:42 AM EDT, Dave via Bug reports for 
GNU grep <[email protected]> wrote:   

 Today, I realized that there are characters which are visually
identical, yet have different unicodes, thus they can't be matched in
grep.

Example #1:
احمدی

Example #2:
احمدى

The ى in both examples are exactly the same, yet the first one is
U+06CC, and second one U+0649.

>From the user's perspective, it's impossible to realize which unicode
the word is using. In fact, these two words, even though they are from
different languages/keyboards, match perfectly on the other letters,
and only it's ی/ى that espaces the match.

While not as important, this letter has other variants like ي (notice
two dots below it, think an umlaut) corresponding to U+064A. If you
press Ctrl + F on your browser, you'd notice that you can match U+064A
with U+0649 one. but this is not the default behavior in grep either.

I understand there's no straightforward solution for this, so I'm
thinking of having an extra flag which converts all visually similar
characters to the same unicode and then looks for matches. Thoughts?



    

Reply via email to