On 15/10/2025 13:34, Michael Cornelison wrote:
The Linux shell command: $ cut -c6- de.text > de2.text
outputs 2114 correct lines with first 5 characters removed.
From line 2115, the two characters (hex 80, hex AF) are prepended to every
output line.
The rest of each output line is correct.
I have attached the file "de.text" which triggers this bug.
I am using Ubuntu 25.04 in case that matters.
regards
Mike Cornelison
The issue is that cut(1) does not support multi-byte characters yet,
and is treating -c like -b. This can cause cut(1) to
output a partial multi-byte character. In your case,
the following shows it starts outputting in the middle of the
UTF-8 Narrow non-breaking space character:
LC_ALL=de_DE.UTF-8 git/coreutils/src/cut -c1-10 de.text |
head -n2115 | tail -n1 | od -Ax -tx1z -v
000000 33 31 30 30 e2 80 af c3 9c 62 0a >3100.....b.<
This is already on our TODO list.
thank you,
Padraig