Re: Question regarding Unicode normalization

Collin Funk Mon, 25 Aug 2025 18:18:54 -0700

Pádraig Brady <p...@draigbrady.com> writes:

> On 25/08/2025 06:47, Collin Funk wrote:
>> I noticed that mcel does not see the following characters as equal in a
>> UTF-8 locale:
>>     è (U+0065 + U+0300)
>>     è (U+00E8)
>> This is because mcel_isbasic (U+0065) sees an ASCII character and
>> does
>> not normalize it using the following U+0300.
>> Is this intentional or not?
>> I had a look at implementing multibyte 'uniq --ignore-case' and it
>> is
>> fairly easy. If we assume normalized Unicode we can even keep it
>> optimized in the UTF-8 case by using memcasecmp and memcmp:
>> static bool
>> different (char *old, char *new, idx_t oldlen, idx_t newlen)
>> {
>>    if (1 < MB_CUR && ignore_case)
>>      {
>>        /* Scan using mcel and c32tolower.  */
>>        return result;
>>      }
>>    if (ignore_case)
>>      return oldlen != newlen || memcasecmp (old, new, oldlen);
>>    else
>>      return oldlen != newlen || memcmp (old, new, oldlen);
>> }
>
> Yes this is the first question posed 
> at:https://www.pixelbeat.org/docs/coreutils_i18n/
> Whatever we decide we should be consistent across all utils.


Oops, I clearly skipped the "planning" part.

> I'm inclined to leave normalization to external tools like iconv and uconv.

That certainly makes life easier. Also, I imagine that normalizing all
input cannot mean good things for performance.

Based on Bruno's response on bug-gnulib, I think your idea is the most
reasonable [1].

I do like the idea of adding Assaf Gordon's 'unorm' program, so that one
can normalize Unicode for other programs to use [2][3]. In other words,
without needing to install 'uconv'.

Not sure how common that is, but anecdotally I did not have 'uconv' on
my system until yesterday. I certainly had programs that depended on
libicu though.

> Note this is also related to how we deal with invalid encodings.
> In that regard I'm inclined that we should fall back to unibyte
> interpretation of invalid multi-byte chars internally.

Yep, my insticts say that replacing it with � (U+FFFD) like the Unicode
standard recommends is correct. But upon second thought it seems wrong
input data.

> How the existing i18n patch deals with this matters too,
> since we want to avoid changes / regressions wrt that.

Sure.

Collin

[1] https://lists.gnu.org/archive/html/bug-gnulib/2025-08/msg00090.html
[2] https://lists.gnu.org/archive/html/coreutils/2017-10/msg00032.html
[3] 
https://web.archive.org/web/20190921221842/https://files.housegordon.org/src/coreutils-multibyte-experimental-8.28.39-79242.tar.xz

Re: Question regarding Unicode normalization

Reply via email to