On 23/08/2025 06:28, Collin Funk wrote:
Pádraig Brady <p...@draigbrady.com> writes:
That is better. I agree the focus on UTF8 is prudent.
Given the tradeoffs here, this seems like the best approach.
This is 4x faster (than i18n patch) in the normal case,
and 2x slower in the LC_ALL=C case.
Fold being focused on text, and usually reasonable amounts of text,
this is a good tradeoff.
Lower level (set) operations like sort, uniq, join
would have more important perf constraints.
Yes, agreed. Hopefully I will come up with some Gnulib modules to handle
reading from files and handling strings without doing:
if (ascii)
fast_case (...)
else
slow_case (...)
Since at that point we are implementing the program twice.
Right, that's best avoided, though for simple programs like uniq
it wouldn't be too onerous I think.
I'd tweak NEWS to spell out "count multi-byte characters"
rather than just "count characters".
Sure, done. I struggled a bit to word NEWS and the documentation since
fold can operate on characters, bytes, or columns. The 小 Unicode
character, for example, is 1 character, 2 columns (based on
EastAsianWidth.txt [1]), and 3 bytes when UTF-8 encoded. But explaining
all of that feels too technical for the Coreutils manual.
Since we're changing --spaces handling too,
to would be good to incorporate spaces mentioned
in tests/wc/wc-nbsp.sh in fold-characters.sh
Good point. I added fold-spaces.sh for testing some breaking Unicode
space characters. And fold-nbsp.sh to test some Unicode non-breaking
space characters.
This actually caught a bug in my v2 patch. When using 'fold --spaces' it
improperly assumed the space was a single byte. This isn't the case
using U+2002 EN SPACE or various other characters.
Anyways, I pushed the attatched patch.
Excellent work on this.
On the subject of testing, this is one place where the i18n patch lacked,
though it does have some adjustments for testing fold.
It would be good to incorporate it's tests/.../fold.pl adjustments.
Also it would be good to add tests for invalid multi-byte characters
to see that they're handled appropriately.
Yes, it would be nice to add tests if they are already written.
Who wrote that patch though? Is it someone who has a copyright
assignment? I didn't reference it for my implementation, so no issues
now. Just not sure if we can import the tests if not.
Tim Waugh IIRC, who was working for Red Hat at the time,
and Red Hat have a blanker corporate copyright assignment.
Thank you!
Pádraig