bug#79702: request: flag for visually identical but different unicode characters

arnold Sun, 26 Oct 2025 12:47:40 -0700

Isn't this what equivalence classes (like [[=e=]]) are supposed
to solve?

Can grep even use them?


Arnold

Dave via Bug reports for GNU grep <[email protected]> wrote:

> Today, I realized that there are characters which are visually
> identical, yet have different unicodes, thus they can't be matched in
> grep.
>
> Example #1:
> احمدی
>
> Example #2:
> احمدى
>
> The ى in both examples are exactly the same, yet the first one is
> U+06CC, and second one U+0649.
>
> From the user's perspective, it's impossible to realize which unicode
> the word is using. In fact, these two words, even though they are from
> different languages/keyboards, match perfectly on the other letters,
> and only it's ی/ى that espaces the match.
>
> While not as important, this letter has other variants like ي (notice
> two dots below it, think an umlaut) corresponding to U+064A. If you
> press Ctrl + F on your browser, you'd notice that you can match U+064A
> with U+0649 one. but this is not the default behavior in grep either.
>
> I understand there's no straightforward solution for this, so I'm
> thinking of having an extra flag which converts all visually similar
> characters to the same unicode and then looks for matches. Thoughts?
>
>
>

bug#79702: request: flag for visually identical but different unicode characters

Reply via email to