Collin Funk <collin.fu...@gmail.com> writes: >>> Good point. >>> >>> Using fread is also faster since it doesn't check for the delimiter upon >>> reading a character, and doesn't need to lock the stream each call >>> (since we use fread_unlocked). >>> >>> Will push the attached in a bit. >> >> Cool, that was quick! >> You might include something like the following test. > > Actually, I realized that it is pointless to malloc line_out now. Since > the input buffer is a fixed size and we do not do case conversion or > anything that may change the number of bytes needed to represent the > output. > > Will push this v2 patch in a bit instead.
Actually, the change from getline to fread made the implementation buggy. It wasn't immediately apparent to me when I wrote the commit. Using getline ensured that we wouldn't have incomplete multibyte sequences at the end of the buffer. Since, at least for Unicode, the newline character cannot be in a multibyte sequence. The use of fread makes that no longer true. Here is the test case that I used to make sure that I was not imagining this issue: $ python3 -c 'print("a" * (262144 - 1) + "뉐" + "a" * 160)' > test.txt $ ./src/fold-old --characters test.txt [...] aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa뉐aaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [...] $ ./src/fold-new --characters test.txt [...] aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa뉐aaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [...] The first run is wrong since the line with 뉐 has 78 characters when it should have 80 like the rest. The second version is fixed. Note that 뉐 has a width of two characters so the line looks like it is 81 characters long (at least using my fonts). This occurs because the first byte of 뉐, 0xEB, is the last character of the input IO_BUFSIZE-sized buffer. The following 0x89, and 0x90 bytes are not available until the next fread. As a result, mcel_scan will return with the fields g.ch == 0 and g.err != 0. Since I did not check g.err, the NUL was counted as a character. This patch fixes the issue by checking g.err and moving possible incomplete sequences to the front of the input buffer before reading again. If the number of bytes available is larger than the maximum number of bytes that can create a multi-byte sequence we know that we have an encoding error. In that case, we print the bytes as we receive them. Will pushed the attached fix with a test case later today. Collin
>From 30dce64595a20ffc56a2a8757527a9b1fde8cef7 Mon Sep 17 00:00:00 2001 Message-ID: <30dce64595a20ffc56a2a8757527a9b1fde8cef7.1756193079.git.collin.fu...@gmail.com> From: Collin Funk <collin.fu...@gmail.com> Date: Mon, 25 Aug 2025 23:15:21 -0700 Subject: [PATCH] fold: don't truncate multibyte characters at the end of the buffer * src/fold.c (fold_file): Replace invalid characters with the original byte read. Copy multibyte sequences that may not yet be read to the start of the buffer before reading more bytes. * tests/fold/fold-characters.sh: Add a test case. --- src/fold.c | 24 ++++++++++++++++++++++-- tests/fold/fold-characters.sh | 19 +++++++++++++++++++ 2 files changed, 41 insertions(+), 2 deletions(-) diff --git a/src/fold.c b/src/fold.c index 7bf30cd0b..208b004d6 100644 --- a/src/fold.c +++ b/src/fold.c @@ -139,6 +139,7 @@ fold_file (char const *filename, size_t width) idx_t offset_out = 0; /* Index in 'line_out' for next char. */ static char line_out[IO_BUFSIZE]; static char line_in[IO_BUFSIZE]; + static size_t offset_in = 0; static size_t length_in = 0; int saved_errno; @@ -158,14 +159,30 @@ fold_file (char const *filename, size_t width) fadvise (istream, FADVISE_SEQUENTIAL); - while (0 < (length_in = fread (line_in, 1, sizeof line_in, istream))) + while (0 < (length_in = fread (line_in + offset_in, 1, + sizeof line_in - offset_in, istream))) { char *p = line_in; - char *lim = p + length_in; + char *lim = p + length_in + offset_in; mcel_t g; for (; p < lim; p += g.len) { g = mcel_scan (p, lim); + if (g.err) + { + /* Replace the character with the byte if it cannot be a + truncated multibyte sequence. */ + if (!(lim - p <= MCEL_LEN_MAX)) + g.ch = p[0]; + else + { + /* It may be a truncated multibyte sequence. Move it to the + front of the input buffer. */ + memmove (line_in, p, lim - p); + offset_in = lim - p; + goto next_line; + } + } if (g.ch == '\n') { memcpy (line_out + offset_out, p, g.len); @@ -241,6 +258,9 @@ fold_file (char const *filename, size_t width) } if (feof (istream)) break; + /* We read a full buffer of complete characters. */ + offset_in = 0; + next_line: } saved_errno = errno; diff --git a/tests/fold/fold-characters.sh b/tests/fold/fold-characters.sh index 0b22aad6b..f6829384b 100755 --- a/tests/fold/fold-characters.sh +++ b/tests/fold/fold-characters.sh @@ -58,4 +58,23 @@ compare column-exp2 column-out2 || fail=1 fold --characters -w 10 input2 > character-out2 || fail=1 compare character-exp2 character-out2 || fail=1 +# Test a Unicode character on the edge of the input buffer. +# Keep in sync with IO_BUFSIZE - 1. +yes a | head -n 262143 | tr -d '\n' > input3 || framework_failure_ +env printf '\uB250' >> input3 || framework_failure_ +yes a | head -n 100 | tr -d '\n' >> input3 || framework_failure_ +env printf '\n' >> input3 || framework_failure_ + +yes a | head -n 80 | tr -d '\n' > exp3 || framework_failure_ +env printf '\n' >> exp3 || framework_failure_ +yes a | head -n 63 | tr -d '\n' >> exp3 || framework_failure_ +env printf '\uB250' >> exp3 || framework_failure_ +yes a | head -n 16 | tr -d '\n' >> exp3 || framework_failure_ +env printf '\n' >> exp3 || framework_failure_ +yes a | head -n 80 | tr -d '\n' >> exp3 || framework_failure_ +env printf '\naaaa\n' >> exp3 || framework_failure_ + +fold --characters input3 | tail -n 4 > out3 || fail=1 +compare exp3 out3 || fail=1 + Exit $fail -- 2.51.0