Re: fold: add the --characters option

Collin Funk Wed, 27 Aug 2025 18:44:26 -0700

Collin Funk <collin.fu...@gmail.com> writes:

>>> 000000
>>> LC_ALL=en_US.UTF-8 src/fold
>>> 000000
>>> LC_ALL=C /bin/fold
>>> 000000 c3                                               >.<
>>> LC_ALL=en_US.UTF-8 /bin/fold
>>> 000000 c3                                               >.<
>>
>> I suppose a concrete way to test that might be:
>>
>>   # https://datatracker.ietf.org/doc/rfc9839/  bad_unicode() { printf 
>> '\xC3|\u0000|\u0089|\uDEAD|\uD9BF\uDFFF\n'; }  test $({ bad_unicode | fold; 
>> bad_unicode; } | uniq | wc -l) = 1 || fail=1
>
> Thanks, I'll have a look at it later today.


Good catch, I see what the issue is.

I kept reading the file if we had a byte that may be an invalid
multi-byte sequence. I did not handle the case of an invalid multi-byte
character being at the end of the file. Therefore, \xC3 was buffered but
never printed.

This patch fixes it. The test feels a bit small for it's own file. But
maybe it should be done anyways though so we can add more test cases?
WDYT?

Collin

>From 5aa2035cdfc65401450acc7fd8a3328e9210eabb Mon Sep 17 00:00:00 2001
Message-ID: <5aa2035cdfc65401450acc7fd8a3328e9210eabb.1756345017.git.collin.fu...@gmail.com>
From: Collin Funk <collin.fu...@gmail.com>
Date: Wed, 27 Aug 2025 18:33:37 -0700
Subject: [PATCH] fold: fix handling of invalid multi-byte characters

* src/fold.c (fold_file): Continue the loop when we have buffered bytes
but nothing left to read from the file.
(adjust_column): Don't assume that the character is printable.
* tests/fold/fold-characters.sh: Add a new test case.
(bad_unicode): New function.
---
 src/fold.c                    | 17 ++++++++++++-----
 tests/fold/fold-characters.sh |  7 +++++++
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/src/fold.c b/src/fold.c
index 343ee62c3..5f71d5c55 100644
--- a/src/fold.c
+++ b/src/fold.c
@@ -115,10 +115,16 @@ adjust_column (size_t column, mcel_t g)
         column = 0;
       else if (g.ch == '\t')
         column += TAB_WIDTH - column % TAB_WIDTH;
-      else /* if (c32isprint (g.ch)) */
+      else
         {
-          last_character_width = (counting_mode == COUNT_CHARACTERS
-                                  ? 1 : c32width (g.ch));
+          if (counting_mode == COUNT_CHARACTERS)
+            last_character_width = 1;
+          else
+            {
+              int width = c32width (g.ch);
+              /* Default to a width of 1 if there is an invalid character.  */
+              last_character_width = width < 0 ? 1 : width;
+            }
           column += last_character_width;
         }
     }
@@ -160,7 +166,8 @@ fold_file (char const *filename, size_t width)
   fadvise (istream, FADVISE_SEQUENTIAL);
 
   while (0 < (length_in = fread (line_in + offset_in, 1,
-                                 sizeof line_in - offset_in, istream)))
+                                 sizeof line_in - offset_in, istream))
+         || 0 < offset_in)
     {
       char *p = line_in;
       char *lim = p + length_in + offset_in;
@@ -172,7 +179,7 @@ fold_file (char const *filename, size_t width)
             {
               /* Replace the character with the byte if it cannot be a
                  truncated multibyte sequence.  */
-              if (!(lim - p <= MCEL_LEN_MAX))
+              if (!(lim - p <= MCEL_LEN_MAX) || length_in == 0)
                 g.ch = p[0];
               else
                 {
diff --git a/tests/fold/fold-characters.sh b/tests/fold/fold-characters.sh
index 7de718450..cd17aa176 100755
--- a/tests/fold/fold-characters.sh
+++ b/tests/fold/fold-characters.sh
@@ -80,6 +80,13 @@ env printf '\naaaa\n' >> exp3 || framework_failure_
 fold --characters input3 | tail -n 4 > out3 || fail=1
 compare exp3 out3 || fail=1
 
+# Sequence derived from <https://datatracker.ietf.org/doc/rfc9839>.
+bad_unicode ()
+{
+  printf '\xC3|\u0000|\u0089|\uDEAD|\uD9BF\uDFFF\n' || framework_failure_
+}
+test $({ bad_unicode | fold; bad_unicode; } | uniq | wc -l) = 1 || fail=1
+
 # Ensure bounded memory operation
 vm=$(get_min_ulimit_v_ fold /dev/null) && {
   yes | tr -d '\n' | (ulimit -v $(($vm+8000)) && fold 2>err) | head || fail=1
-- 
2.51.0

Re: fold: add the --characters option

Reply via email to