Pádraig Brady <p...@draigbrady.com> writes: > On 25/08/2025 06:47, Collin Funk wrote: >> I noticed that mcel does not see the following characters as equal in a >> UTF-8 locale: >> è (U+0065 + U+0300) >> è (U+00E8) >> This is because mcel_isbasic (U+0065) sees an ASCII character and >> does >> not normalize it using the following U+0300. >> Is this intentional or not? >> I had a look at implementing multibyte 'uniq --ignore-case' and it >> is >> fairly easy. If we assume normalized Unicode we can even keep it >> optimized in the UTF-8 case by using memcasecmp and memcmp: >> static bool >> different (char *old, char *new, idx_t oldlen, idx_t newlen) >> { >> if (1 < MB_CUR && ignore_case) >> { >> /* Scan using mcel and c32tolower. */ >> return result; >> } >> if (ignore_case) >> return oldlen != newlen || memcasecmp (old, new, oldlen); >> else >> return oldlen != newlen || memcmp (old, new, oldlen); >> } > > Yes this is the first question posed > at:https://www.pixelbeat.org/docs/coreutils_i18n/ > Whatever we decide we should be consistent across all utils.
Oops, I clearly skipped the "planning" part. > I'm inclined to leave normalization to external tools like iconv and uconv. That certainly makes life easier. Also, I imagine that normalizing all input cannot mean good things for performance. Based on Bruno's response on bug-gnulib, I think your idea is the most reasonable [1]. I do like the idea of adding Assaf Gordon's 'unorm' program, so that one can normalize Unicode for other programs to use [2][3]. In other words, without needing to install 'uconv'. Not sure how common that is, but anecdotally I did not have 'uconv' on my system until yesterday. I certainly had programs that depended on libicu though. > Note this is also related to how we deal with invalid encodings. > In that regard I'm inclined that we should fall back to unibyte > interpretation of invalid multi-byte chars internally. Yep, my insticts say that replacing it with � (U+FFFD) like the Unicode standard recommends is correct. But upon second thought it seems wrong input data. > How the existing i18n patch deals with this matters too, > since we want to avoid changes / regressions wrt that. Sure. Collin [1] https://lists.gnu.org/archive/html/bug-gnulib/2025-08/msg00090.html [2] https://lists.gnu.org/archive/html/coreutils/2017-10/msg00032.html [3] https://web.archive.org/web/20190921221842/https://files.housegordon.org/src/coreutils-multibyte-experimental-8.28.39-79242.tar.xz