multibyte processing - handling invalid sequences (long)

Assaf Gordon Tue, 19 Jul 2016 23:12:08 -0700

Hello all,

I'd like to discuss few aspect of multibyte processing in coreutils, as a 
preparation for future improvements.


To start with an "easy" topic: how to handle invalid input (i.e. input octets 
that result in invalid multibyte sequence).
Previous discussion said no internal conversion to wchar_t so that invalid 
sequences can be handled as C locale ( 
https://lists.gnu.org/archive/html/coreutils/2010-09/msg00051.html ).
Pádraig's i18n plan left the handling issue open ("How do we handle invalid 
encodings; substitution, elision, leaving in place?",  
http://www.pixelbeat.org/docs/coreutils_i18n/).

Is there an agreement on how to handle those?

Do we want to fall-back to C locale, and does that imply going back and 
revising invalid octets and re-processing them as single-byte characters ?
If so, the implementation need to keep the N octets (up to MB_CUR_MAX), and be 
able to go back and process them. Alternatively, we can treat only the last 
octet (the offending one that caused the sequence to be invalid) as a 
single-byte character, thus possibly losing data.


One possibility is to have all programs print an informative warning to stderr 
upon the detection of the first invalid multibyte sequence, then resort to 
'best-effort' (e.g. only the last octet, or something else that's easy to 
implement).
My rational is that for an input file with invalid sequences, there is no one 
correct solution that would satisfy all cases: some users would think the 
obvious correct solution is to output invalid sequences as-is, others would 
think they should be silently ignored (i.e. a program should never generate 
invalid output even on invalid input).
The best we could do is warn them, and document a way to fix invalid files 
(along the lines of 'iconv --byte-subst="<0x%x>"'). Users could always fallback 
to forcing C locale and then all input bytes will be processed.



To be more concrete, here are some examples:

The unicode code-point U+2460 is 'CIRCLED DIGIT ONE',
in UTF-8 octal: printf '\342\221\240'
I'll use the invalid sequence '\342\221\300' as input below.

What should be the output in the following cases:

'cut': should it print '\300' or '\342' ?

    printf '\342\221\300' | LC_ALL=en_US.UTF-8 cut -c1


'wc': should it print 1 (counting only '\300') or 3 (counting all octets) or 0 ?
currently it prints 0 because it doesn't count invalid multibyte characters.

    printf '\342\221\300' | LC_ALL=en_US.UTF-8 wc -m

similar issue, but perhaps with different logic and rationale, with "wc -L".


'expand': should this be expanded to '\300' + 7 spaces + 'A',
or '\342\221\300' + 5 spaces + 'A' ? or something else ?

    printf '\342\221\300\tA\n' | LC_ALL=en_US.UTF-8 expand



'fold': should this print: 'aa\342\n\221\300b\n' (treating them as 
single-bytes), or
'aa\300\nb\n' (using only the last octet), or something else?

    printf 'aa\342\221\300b\n' | LC_ALL=en_US.UTF-8 fold -w 3


'printf' - deals only with bytes. e.g. the following should be printed as-is:

    env printf '%s\n' "$(env printf '\342\221\300')"
    env printf "$(env printf '\342\221\300')"


'fmt' and 'pr': I assume they should print the invalid sequence as is, as they 
do not break mid-words.

'head', 'tail', 'split' - not relevant as they deal with bytes, not characters.

'csplit': only indirectly relevant, as I seem to remember that standard regex 
should never
match an invalid multibyte sequence?

'shuf','paste' - not relevant as it deals with complete lines.

'yes' - prints input as-is, e.g. the following works:

    yes "$(env printf '\342\221\300')"

'test' - operators '-n' and '-z' work correctly with invalid sequences.

'expr': regex operations should never match (IIUC).
for 'substr', should this return '\300' or '\342' ?

    LC_ALL=en_US.UTF-8 expr substr "$(printf '\342\221\300')" 1 1

for 'length', should this return 3 (treating as 3 single-bytes) or 1 (counting 
the last offending octet)?

    LC_ALL=en_US.UTF-8 expr length "$(printf '\342\221\300')"

for 'index', both STRING and CHAR might be invalid. Should an invalid CHAR 
parameter be rejected outright ?

'numfmt' - as long as it doesn't get confused with a digit character, invalid 
sequences should be printed 'as-is'.

'seq' - doesn't take any input.

'date' - should print invalid characters in format string as-is.


For now I'm going to side-step sort+join+uniq, as I think they present a more 
complicated set of issues when it comes to multibyte processing.


comments very welcomed,
 - assaf

multibyte processing - handling invalid sequences (long)

Reply via email to