Re: fold: add the --characters option

Collin Funk Tue, 26 Aug 2025 00:25:51 -0700

Collin Funk <collin.fu...@gmail.com> writes:

>>> Good point.
>>> 
>>> Using fread is also faster since it doesn't check for the delimiter upon
>>> reading a character, and doesn't need to lock the stream each call
>>> (since we use fread_unlocked).
>>> 
>>> Will push the attached in a bit.
>>
>> Cool, that was quick!
>> You might include something like the following test.
>
> Actually, I realized that it is pointless to malloc line_out now. Since
> the input buffer is a fixed size and we do not do case conversion or
> anything that may change the number of bytes needed to represent the
> output.
>
> Will push this v2 patch in a bit instead.


Actually, the change from getline to fread made the implementation
buggy. It wasn't immediately apparent to me when I wrote the commit.

Using getline ensured that we wouldn't have incomplete multibyte
sequences at the end of the buffer. Since, at least for Unicode, the
newline character cannot be in a multibyte sequence. The use of fread
makes that no longer true.

Here is the test case that I used to make sure that I was not imagining
this issue:

    $ python3 -c 'print("a" * (262144 - 1) + "뉐" + "a" * 160)' > test.txt
    $ ./src/fold-old --characters test.txt
    [...]
    
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa뉐aaaaaaaaaaaaaa
    
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    [...]
    
    $ ./src/fold-new --characters test.txt
    [...]
    
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa뉐aaaaaaaaaaaaaaaa
    
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    [...]

The first run is wrong since the line with 뉐 has 78 characters when it
should have 80 like the rest. The second version is fixed. Note that 뉐
has a width of two characters so the line looks like it is 81 characters
long (at least using my fonts).

This occurs because the first byte of 뉐, 0xEB, is the last character of
the input IO_BUFSIZE-sized buffer. The following 0x89, and 0x90 bytes
are not available until the next fread. As a result, mcel_scan will
return with the fields g.ch == 0 and g.err != 0. Since I did not check
g.err, the NUL was counted as a character.

This patch fixes the issue by checking g.err and moving possible
incomplete sequences to the front of the input buffer before reading
again. If the number of bytes available is larger than the maximum
number of bytes that can create a multi-byte sequence we know that we
have an encoding error. In that case, we print the bytes as we receive
them.

Will pushed the attached fix with a test case later today.

Collin

>From 30dce64595a20ffc56a2a8757527a9b1fde8cef7 Mon Sep 17 00:00:00 2001
Message-ID: <30dce64595a20ffc56a2a8757527a9b1fde8cef7.1756193079.git.collin.fu...@gmail.com>
From: Collin Funk <collin.fu...@gmail.com>
Date: Mon, 25 Aug 2025 23:15:21 -0700
Subject: [PATCH] fold: don't truncate multibyte characters at the end of the
 buffer

* src/fold.c (fold_file): Replace invalid characters with the original
byte read. Copy multibyte sequences that may not yet be read to the
start of the buffer before reading more bytes.
* tests/fold/fold-characters.sh: Add a test case.
---
 src/fold.c                    | 24 ++++++++++++++++++++++--
 tests/fold/fold-characters.sh | 19 +++++++++++++++++++
 2 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/src/fold.c b/src/fold.c
index 7bf30cd0b..208b004d6 100644
--- a/src/fold.c
+++ b/src/fold.c
@@ -139,6 +139,7 @@ fold_file (char const *filename, size_t width)
   idx_t offset_out = 0;		/* Index in 'line_out' for next char. */
   static char line_out[IO_BUFSIZE];
   static char line_in[IO_BUFSIZE];
+  static size_t offset_in = 0;
   static size_t length_in = 0;
   int saved_errno;
 
@@ -158,14 +159,30 @@ fold_file (char const *filename, size_t width)
 
   fadvise (istream, FADVISE_SEQUENTIAL);
 
-  while (0 < (length_in = fread (line_in, 1, sizeof line_in, istream)))
+  while (0 < (length_in = fread (line_in + offset_in, 1,
+                                 sizeof line_in - offset_in, istream)))
     {
       char *p = line_in;
-      char *lim = p + length_in;
+      char *lim = p + length_in + offset_in;
       mcel_t g;
       for (; p < lim; p += g.len)
         {
           g = mcel_scan (p, lim);
+          if (g.err)
+            {
+              /* Replace the character with the byte if it cannot be a
+                 truncated multibyte sequence.  */
+              if (!(lim - p <= MCEL_LEN_MAX))
+                g.ch = p[0];
+              else
+                {
+                  /* It may be a truncated multibyte sequence.  Move it to the
+                     front of the input buffer.  */
+                  memmove (line_in, p, lim - p);
+                  offset_in = lim - p;
+                  goto next_line;
+                }
+            }
           if (g.ch == '\n')
             {
               memcpy (line_out + offset_out, p, g.len);
@@ -241,6 +258,9 @@ fold_file (char const *filename, size_t width)
         }
       if (feof (istream))
         break;
+      /* We read a full buffer of complete characters.  */
+      offset_in = 0;
+    next_line:
     }
 
   saved_errno = errno;
diff --git a/tests/fold/fold-characters.sh b/tests/fold/fold-characters.sh
index 0b22aad6b..f6829384b 100755
--- a/tests/fold/fold-characters.sh
+++ b/tests/fold/fold-characters.sh
@@ -58,4 +58,23 @@ compare column-exp2 column-out2 || fail=1
 fold --characters -w 10 input2 > character-out2 || fail=1
 compare character-exp2 character-out2 || fail=1
 
+# Test a Unicode character on the edge of the input buffer.
+# Keep in sync with IO_BUFSIZE - 1.
+yes a | head -n 262143 | tr -d '\n' > input3 || framework_failure_
+env printf '\uB250' >> input3 || framework_failure_
+yes a | head -n 100 | tr -d '\n' >> input3 || framework_failure_
+env printf '\n' >> input3 || framework_failure_
+
+yes a | head -n 80 | tr -d '\n' > exp3 || framework_failure_
+env printf '\n' >> exp3 || framework_failure_
+yes a | head -n 63 | tr -d '\n' >> exp3 || framework_failure_
+env printf '\uB250' >> exp3 || framework_failure_
+yes a | head -n 16 | tr -d '\n' >> exp3 || framework_failure_
+env printf '\n' >> exp3 || framework_failure_
+yes a | head -n 80 | tr -d '\n' >> exp3 || framework_failure_
+env printf '\naaaa\n' >> exp3 || framework_failure_
+
+fold --characters input3 | tail -n 4 > out3 || fail=1
+compare exp3 out3 || fail=1
+
 Exit $fail
-- 
2.51.0

Re: fold: add the --characters option

Reply via email to