Re: fold: add the --characters option

Pádraig Brady Thu, 21 Aug 2025 05:34:14 -0700

On 21/08/2025 05:51, Collin Funk wrote:

I noticed that Fedora and OpenSUSE (likely others) add a patch to
Coreutils for 'fold --characters'. Not sure if that option has been
discussed here.


Anyways, I wrote the following patch which is much simpler than the one
they use, in my opinion. The mbfile module from Gnulib provides a
similar interface to stdio's getc. When operating on ascii characters it
does not call mbrtoc32, etc.

Here is an example of the behavior using 뉐 which has a width of two
columns:

     $ for i in $(seq 10); do printf '\uB250' >> test.txt; done
     $ printf '\n' >> test.txt
     $ cat test.txt
     뉐뉐뉐뉐뉐뉐뉐뉐뉐뉐
     $ ./src/fold -w 5 test.txt
     뉐뉐
     뉐뉐
     뉐뉐
     뉐뉐
     뉐뉐
     $ ./src/fold --characters -w 5 test.txt
     뉐뉐뉐뉐뉐
     뉐뉐뉐뉐뉐

What do you think?

It would be nice to improve unicode support and this was an easy
start. Something like 'tr' is much more difficult.


The approach looks sound.

Note building with -O3 gave the following warning
which should be fixed, but which I've not looked at:
  In function 'mbfile_multi_getc',
      inlined from 'fold_file' at src/fold.c:164:7:
  ./lib/mbfile.h:235:14: error: writing 1 byte into a region of size 0 
[-Werror=stringop-overflow=]
    235 |           *p = *(p + bytes);
        |              ^
  ./lib/mbfile.h: In function 'fold_file':
  ./lib/mbfile.h:82:8: note: at offset [36, 4294967268] into destination object 
'buf' of size 4
     82 |   char buf[MBCHAR_BUF_SIZE];
        |        ^

Size hasn't significantly increased:
  $ size src/fold
     text          data     bss     dec     hex filename
    33503          1020     512   35035    88db src/fold
  $ size src/fold-c
     text          data     bss     dec     hex filename
    37971          1060     504   39535    9a6f src/fold-c
BTW libunistring is available as a shared library,
and whether to link to it or not is a tradeoff in startup speed
vs size increase in linked tables etc.  I've not looked into the details here
but this was previously discussed for avoiding linking with libunistring for 
printf(1) at:
https://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00003.html

Re performance, it's good, but it would be great if
we could maintain the LC_ALL=C performance like the i18n patch does.
Some quick testing shows:

  $ yes `seq 100` | head -n 1M > file.in

  # Note /bin/fold has the the (Fedora) i18n patch applied
  $ for L in en_US.UTF-8 C; do
      for FOLD in src/fold src/fold-c /bin/fold; do
        printf "LC_ALL=$L $FOLD: "
        time LC_ALL=$L $FOLD < file.in | wc -l
      done
    done

  LC_ALL=en_US.UTF-8 src/fold: 4194304
  real  0m1.046s
  LC_ALL=en_US.UTF-8 src/fold-c: 4194304
  real  0m8.294s
  LC_ALL=en_US.UTF-8 /bin/fold: 4194304
  real  0m11.556s
  LC_ALL=C src/fold: 4194304
  real  0m0.979s
  LC_ALL=C src/fold-c: 4194304
  real  0m8.277s
  LC_ALL=C /bin/fold: 4194304
  real  0m0.976s

I.e. we beat the i18n patch implementation,
but we don't shortcut the LC_ALL=C case.

Re the test, I adjusted to gate with $LOCALE_FR_UTF8
which is the standard multi-byte locale checked for by gnulib.I've also made 
some other tweaks (attached).Since this code isn't utf8 specific, it would be 
good to test
other charsets/characters I think, using the `locale charmap`
trick to see if they're available. My Fedora 42 system has the
following potentials for example:

$ locale -a | cut -s -d. -f2 | sort -u | grep -Eiv '(utf8|iso|8|cp)'
big5
big5hkscs
euc
eucjp
euckr
euctw
gb2312
gbk
georgianps
pt154
tis620
ujis


BTW I've some general notes on i18n in coreutils at:
https://www.pixelbeat.org/docs/coreutils_i18n/

Thanks for working on this!
Padraig

Re: fold: add the --characters option

Reply via email to