Collin Funk <collin.fu...@gmail.com> writes: >>> 000000 >>> LC_ALL=en_US.UTF-8 src/fold >>> 000000 >>> LC_ALL=C /bin/fold >>> 000000 c3 >.< >>> LC_ALL=en_US.UTF-8 /bin/fold >>> 000000 c3 >.< >> >> I suppose a concrete way to test that might be: >> >> # https://datatracker.ietf.org/doc/rfc9839/ bad_unicode() { printf >> '\xC3|\u0000|\u0089|\uDEAD|\uD9BF\uDFFF\n'; } test $({ bad_unicode | fold; >> bad_unicode; } | uniq | wc -l) = 1 || fail=1 > > Thanks, I'll have a look at it later today.
Good catch, I see what the issue is. I kept reading the file if we had a byte that may be an invalid multi-byte sequence. I did not handle the case of an invalid multi-byte character being at the end of the file. Therefore, \xC3 was buffered but never printed. This patch fixes it. The test feels a bit small for it's own file. But maybe it should be done anyways though so we can add more test cases? WDYT? Collin
>From 5aa2035cdfc65401450acc7fd8a3328e9210eabb Mon Sep 17 00:00:00 2001 Message-ID: <5aa2035cdfc65401450acc7fd8a3328e9210eabb.1756345017.git.collin.fu...@gmail.com> From: Collin Funk <collin.fu...@gmail.com> Date: Wed, 27 Aug 2025 18:33:37 -0700 Subject: [PATCH] fold: fix handling of invalid multi-byte characters * src/fold.c (fold_file): Continue the loop when we have buffered bytes but nothing left to read from the file. (adjust_column): Don't assume that the character is printable. * tests/fold/fold-characters.sh: Add a new test case. (bad_unicode): New function. --- src/fold.c | 17 ++++++++++++----- tests/fold/fold-characters.sh | 7 +++++++ 2 files changed, 19 insertions(+), 5 deletions(-) diff --git a/src/fold.c b/src/fold.c index 343ee62c3..5f71d5c55 100644 --- a/src/fold.c +++ b/src/fold.c @@ -115,10 +115,16 @@ adjust_column (size_t column, mcel_t g) column = 0; else if (g.ch == '\t') column += TAB_WIDTH - column % TAB_WIDTH; - else /* if (c32isprint (g.ch)) */ + else { - last_character_width = (counting_mode == COUNT_CHARACTERS - ? 1 : c32width (g.ch)); + if (counting_mode == COUNT_CHARACTERS) + last_character_width = 1; + else + { + int width = c32width (g.ch); + /* Default to a width of 1 if there is an invalid character. */ + last_character_width = width < 0 ? 1 : width; + } column += last_character_width; } } @@ -160,7 +166,8 @@ fold_file (char const *filename, size_t width) fadvise (istream, FADVISE_SEQUENTIAL); while (0 < (length_in = fread (line_in + offset_in, 1, - sizeof line_in - offset_in, istream))) + sizeof line_in - offset_in, istream)) + || 0 < offset_in) { char *p = line_in; char *lim = p + length_in + offset_in; @@ -172,7 +179,7 @@ fold_file (char const *filename, size_t width) { /* Replace the character with the byte if it cannot be a truncated multibyte sequence. */ - if (!(lim - p <= MCEL_LEN_MAX)) + if (!(lim - p <= MCEL_LEN_MAX) || length_in == 0) g.ch = p[0]; else { diff --git a/tests/fold/fold-characters.sh b/tests/fold/fold-characters.sh index 7de718450..cd17aa176 100755 --- a/tests/fold/fold-characters.sh +++ b/tests/fold/fold-characters.sh @@ -80,6 +80,13 @@ env printf '\naaaa\n' >> exp3 || framework_failure_ fold --characters input3 | tail -n 4 > out3 || fail=1 compare exp3 out3 || fail=1 +# Sequence derived from <https://datatracker.ietf.org/doc/rfc9839>. +bad_unicode () +{ + printf '\xC3|\u0000|\u0089|\uDEAD|\uD9BF\uDFFF\n' || framework_failure_ +} +test $({ bad_unicode | fold; bad_unicode; } | uniq | wc -l) = 1 || fail=1 + # Ensure bounded memory operation vm=$(get_min_ulimit_v_ fold /dev/null) && { yes | tr -d '\n' | (ulimit -v $(($vm+8000)) && fold 2>err) | head || fail=1 -- 2.51.0