Re: coreutils i18n

Pádraig Brady Sun, 24 Aug 2025 04:36:18 -0700

On 24/08/2025 01:06, Bruno Haible via GNU coreutils General Discussion wrote:

Hi Pádraig,


You wrote in https://lists.gnu.org/archive/html/coreutils/2025-08/msg00032.html:

BTW I've some general notes on i18n in coreutils at:
https://www.pixelbeat.org/docs/coreutils_i18n/


An interesting read. Please allow me three remarks:


Note I wrote this 10 years ago when I presented it to Red Hat management
to try and get 3 months of my time dedicated to improving the situation.
Unfortunately Red Hat priorities were elsewhere.


* Regarding the history.
   There you write: "nothing was completed due to the size of the work 
involved.".

   No, that's not how I recall it.
   - When only the display width of a string in multibyte locales was the issue,
     support for added by #include "mbswidth.h" from Gnulib.


Thanks for the clarification. I've adjusted the page.

   - For tools that process characters in a non-trivial loop, indeed, nothing
     was completed. As I recall, it was because Jim did not agree with any of 
the
     three approaches that I proposed.

     One of the approaches was to write code like

       if (MB_CUR_MAX > 1)
         {
           ...code for multibyte locales...
         }
       else
         {
           ...code for unibyte locales...
         }

     Jim did not like this one because it duplicates the logic. (And right he 
is.
     I like to say that code duplication is a professional mistake.)

     Another approach that I proposed was to write code with the mbchar.h module
     from Gnulib. This does not duplicate the logic, but it came with a
     performance penalty for the unibyte locales; Jim rejected it for this
     reason. At that time, most of the locales were unibyte locales. Still 
today,
     the "C" locale is unibyte and is used in many places. Therefore this
     argument is still valid today.

     Another approach that I proposed was the one used by the 'fnmatch' module
     in Gnulib: Move out the core loop to a separate file, and parameterize this
     file so that it can be used in two modes: for the unibyte case, working on
     types such as 'char', and for the multibyte case, working on types such as
     'wchar_t'. (Nowadays that should be 'char32_t', not 'wchar_t'.) Jim 
rejected
     this approach as well. (Or maybe Paul did? I don't remember in detail.)

     Then I didn't see any other options (given that C does not have generics
     like other programming languages), and gave up.

   What is your position regarding these three approaches today? Or do you see
   another approach, in order to avoid code duplication while keeping the code
   maintainable?


For simple loops I think it's feasible to duplicate.
For more involved logic it's best to not duplicate.

* You write: "Note wchar_t is only 16 bits on windows"
   The wchar_t problem has been solved through the char32_t type, which is well
   supported in Gnulib now, see
   
https://www.gnu.org/software/gnulib/manual/html_node/Comparison-of-character-APIs.html

* Beyond what is multibyte functionality specified by POSIX, the feature I
   would love most to see in coreutils is for 'fold' to support line breaking
   according to the Unicode line breaking algorithm. This would make 'fold'
   useful e.g. in Chinese, where spaces are not used to separate words.
   This would imply adding an option
     fold --unicode
   and making use of the Gnulib module 'unilbrk/ulc-width-linebreaks' or
'unilbrk/ulc-possible-linebreaks'.


I've updated the page with the above info.

thanks!
Padraig

Re: coreutils i18n

Reply via email to