Re: coreutils i18n

Collin Funk Sat, 23 Aug 2025 18:56:01 -0700

Hi Bruno,

Bruno Haible via GNU coreutils General Discussion <coreutils@gnu.org>
writes:


> Hi Pádraig,
>
> You wrote in 
> https://lists.gnu.org/archive/html/coreutils/2025-08/msg00032.html:
>> BTW I've some general notes on i18n in coreutils at:
>> https://www.pixelbeat.org/docs/coreutils_i18n/
>
> An interesting read. Please allow me three remarks:

Not Pádraig, but I can give some thoughts.

> * Regarding the history. 
>   There you write: "nothing was completed due to the size of the work 
> involved.".
>
>   No, that's not how I recall it.
>   - When only the display width of a string in multibyte locales was the 
> issue,
>     support for added by #include "mbswidth.h" from Gnulib.
>   - For tools that process characters in a non-trivial loop, indeed, nothing
>     was completed. As I recall, it was because Jim did not agree with any of 
> the
>     three approaches that I proposed.
>
>     One of the approaches was to write code like
>
>       if (MB_CUR_MAX > 1)
>         {
>           ...code for multibyte locales...
>         }
>       else
>         {
>           ...code for unibyte locales...
>         }
>
>     Jim did not like this one because it duplicates the logic. (And right he 
> is.
>     I like to say that code duplication is a professional mistake.)

I agree with you and Jim. I think that situation is best avoided, as it
can cause problems in many ways. For example, one makes changes in the
multibyte branch but forgets in the unibyte one. Ideally code review
would catch this, but people get busy.

However, it will likely be needed in some places for performance.

>     Another approach that I proposed was to write code with the mbchar.h 
> module
>     from Gnulib. This does not duplicate the logic, but it came with a
>     performance penalty for the unibyte locales; Jim rejected it for this
>     reason. At that time, most of the locales were unibyte locales. Still 
> today,
>     the "C" locale is unibyte and is used in many places. Therefore this
>     argument is still valid today.

My original fold patch used the mbfile and mbchar modules. I thought
they were nice to use, and they support every encoding supported by
mbrtoc32/mbrtowc which is great. However it made the program much
slower, even when LC_ALL=C was used [1]. Using getline (...) and mcel is
much faster but does not support the same amount of encodings [2].

Using mbchar makes sense for GNU Bison, which I learned uses it, where
it is unlikely to ever be run on massive files. But for Coreutils I
imagine slowing down 'LC_ALL=C sort', for example, will cause quite a
few angry messages to the mailing list. :)

Also, I doubt most will care about obsolete encodings not supported by
mcel (minus some z/OS people who use EBCDIC). 90-something percent of
the web is UTF-8, and I doubt local files are much different.

> * You write: "Note wchar_t is only 16 bits on windows"
>   The wchar_t problem has been solved through the char32_t type, which is well
>   supported in Gnulib now, see
>   
> https://www.gnu.org/software/gnulib/manual/html_node/Comparison-of-character-APIs.html

With the minor exception of the regex functions which you are still
working on, IIRC. But that affects grep much more than Coreutils.

I think Pádraig knows about that, just hasn't updated the page in a
while. He added 'dd' after I sent him the following example:

    $ echo abc > input1.txt
    $ dd if=input1.txt conv=ucase status=none
    ABC
    $ echo 'привітав' > input2.txt
    $ dd if=input2.txt conv=ucase status=none
    привітав

One would expect the following:

    $ python3 -c 'print("привітав".upper())'
    ПРИВІТАВ

Also, another I noticed. If you use a username with multibyte-characters
(which seem like a bad idea to me, but I suppose nothing stops you from
using one) pinky doesn't behave correctly. See the following example:

    $ grep -F test-user /etc/passwd
    test-user:x:1001:1001:ab&cd:/usr/share/empty:/bin/bash
    $ pinky -l test-user
    Login name: test-user                   In real life:  abTest-usercd
    Directory: /usr/share/empty             Shell:  /bin/bash

The first letter of the username can only be capitalized if it is ASCII:

    $ pinky -l átest-user
    Login name: átest-user                 In real life:  abátest-usercd
    Directory: /usr/share/empty             Shell:  /bin/bash

One would expect 'In real life:  abÁtest-usercd'. Also the alignment
doesn't account for character widths.

> * Beyond what is multibyte functionality specified by POSIX, the feature I
>   would love most to see in coreutils is for 'fold' to support line breaking
>   according to the Unicode line breaking algorithm. This would make 'fold'
>   useful e.g. in Chinese, where spaces are not used to separate words.
>   This would imply adding an option
>     fold --unicode
>   and making use of the Gnulib module 'unilbrk/ulc-width-linebreaks' or
>   'unilbrk/ulc-possible-linebreaks'.

I think this is a good idea, thanks. Maybe others will take issue to
linking to the large tables, though.

Collin

[1] https://lists.gnu.org/archive/html/coreutils/2025-08/msg00032.html
[2] https://lists.gnu.org/archive/html/coreutils/2025-08/msg00036.html

Re: coreutils i18n

Reply via email to