On 21/08/2025 05:51, Collin Funk wrote:
I noticed that Fedora and OpenSUSE (likely others) add a patch to
Coreutils for 'fold --characters'. Not sure if that option has been
discussed here.
Anyways, I wrote the following patch which is much simpler than the one
they use, in my opinion. The mbfile module from Gnulib provides a
similar interface to stdio's getc. When operating on ascii characters it
does not call mbrtoc32, etc.
Here is an example of the behavior using 뉐 which has a width of two
columns:
$ for i in $(seq 10); do printf '\uB250' >> test.txt; done
$ printf '\n' >> test.txt
$ cat test.txt
뉐뉐뉐뉐뉐뉐뉐뉐뉐뉐
$ ./src/fold -w 5 test.txt
뉐뉐
뉐뉐
뉐뉐
뉐뉐
뉐뉐
$ ./src/fold --characters -w 5 test.txt
뉐뉐뉐뉐뉐
뉐뉐뉐뉐뉐
What do you think?
It would be nice to improve unicode support and this was an easy
start. Something like 'tr' is much more difficult.
The approach looks sound.
Note building with -O3 gave the following warning
which should be fixed, but which I've not looked at:
In function 'mbfile_multi_getc',
inlined from 'fold_file' at src/fold.c:164:7:
./lib/mbfile.h:235:14: error: writing 1 byte into a region of size 0
[-Werror=stringop-overflow=]
235 | *p = *(p + bytes);
| ^
./lib/mbfile.h: In function 'fold_file':
./lib/mbfile.h:82:8: note: at offset [36, 4294967268] into destination object
'buf' of size 4
82 | char buf[MBCHAR_BUF_SIZE];
| ^
Size hasn't significantly increased:
$ size src/fold
text data bss dec hex filename
33503 1020 512 35035 88db src/fold
$ size src/fold-c
text data bss dec hex filename
37971 1060 504 39535 9a6f src/fold-c
BTW libunistring is available as a shared library,
and whether to link to it or not is a tradeoff in startup speed
vs size increase in linked tables etc. I've not looked into the details here
but this was previously discussed for avoiding linking with libunistring for
printf(1) at:
https://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00003.html
Re performance, it's good, but it would be great if
we could maintain the LC_ALL=C performance like the i18n patch does.
Some quick testing shows:
$ yes `seq 100` | head -n 1M > file.in
# Note /bin/fold has the the (Fedora) i18n patch applied
$ for L in en_US.UTF-8 C; do
for FOLD in src/fold src/fold-c /bin/fold; do
printf "LC_ALL=$L $FOLD: "
time LC_ALL=$L $FOLD < file.in | wc -l
done
done
LC_ALL=en_US.UTF-8 src/fold: 4194304
real 0m1.046s
LC_ALL=en_US.UTF-8 src/fold-c: 4194304
real 0m8.294s
LC_ALL=en_US.UTF-8 /bin/fold: 4194304
real 0m11.556s
LC_ALL=C src/fold: 4194304
real 0m0.979s
LC_ALL=C src/fold-c: 4194304
real 0m8.277s
LC_ALL=C /bin/fold: 4194304
real 0m0.976s
I.e. we beat the i18n patch implementation,
but we don't shortcut the LC_ALL=C case.
Re the test, I adjusted to gate with $LOCALE_FR_UTF8
which is the standard multi-byte locale checked for by gnulib.I've also made
some other tweaks (attached).Since this code isn't utf8 specific, it would be
good to test
other charsets/characters I think, using the `locale charmap`
trick to see if they're available. My Fedora 42 system has the
following potentials for example:
$ locale -a | cut -s -d. -f2 | sort -u | grep -Eiv '(utf8|iso|8|cp)'
big5
big5hkscs
euc
eucjp
euckr
euctw
gb2312
gbk
georgianps
pt154
tis620
ujis
BTW I've some general notes on i18n in coreutils at:
https://www.pixelbeat.org/docs/coreutils_i18n/
Thanks for working on this!
Padraig