Re: multibyte processing - handling invalid sequences (long)

Pádraig Brady Wed, 20 Jul 2016 05:21:40 -0700

On 20/07/16 07:11, Assaf Gordon wrote:
> Hello all,
> 
> I'd like to discuss few aspect of multibyte processing in coreutils, as a 
> preparation for future improvements.
> 
> To start with an "easy" topic: how to handle invalid input (i.e. input octets 
> that result in invalid multibyte sequence).
> Previous discussion said no internal conversion to wchar_t so that invalid 
> sequences can be handled as C locale ( 
> https://lists.gnu.org/archive/html/coreutils/2010-09/msg00051.html ).
> Pádraig's i18n plan left the handling issue open ("How do we handle invalid 
> encodings; substitution, elision, leaving in place?",  
> http://www.pixelbeat.org/docs/coreutils_i18n/).
> 
> Is there an agreement on how to handle those?
> 
> Do we want to fall-back to C locale, and does that imply going back and 
> revising invalid octets and re-processing them as single-byte characters ?
> If so, the implementation need to keep the N octets (up to MB_CUR_MAX), and 
> be able to go back and process them. Alternatively, we can treat only the 
> last octet (the offending one that caused the sequence to be invalid) as a 
> single-byte character, thus possibly losing data.
> 
> 
> One possibility is to have all programs print an informative warning to 
> stderr upon the detection of the first invalid multibyte sequence, then 
> resort to 'best-effort' (e.g. only the last octet, or something else that's 
> easy to implement).
> My rational is that for an input file with invalid sequences, there is no one 
> correct solution that would satisfy all cases: some users would think the 
> obvious correct solution is to output invalid sequences as-is, others would 
> think they should be silently ignored (i.e. a program should never generate 
> invalid output even on invalid input).
> The best we could do is warn them, and document a way to fix invalid files 
> (along the lines of 'iconv --byte-subst="<0x%x>"'). Users could always 
> fallback to forcing C locale and then all input bytes will be processed.
> 
> 
> 
> To be more concrete, here are some examples:
> 
> The unicode code-point U+2460 is 'CIRCLED DIGIT ONE',
> in UTF-8 octal: printf '\342\221\240'
> I'll use the invalid sequence '\342\221\300' as input below.
> 
> What should be the output in the following cases:
> 
> 'cut': should it print '\300' or '\342' ?
> 
>     printf '\342\221\300' | LC_ALL=en_US.UTF-8 cut -c1
> 
> 
> 'wc': should it print 1 (counting only '\300') or 3 (counting all octets) or 
> 0 ?
> currently it prints 0 because it doesn't count invalid multibyte characters.
> 
>     printf '\342\221\300' | LC_ALL=en_US.UTF-8 wc -m
> 
> similar issue, but perhaps with different logic and rationale, with "wc -L".
> 
> 
> 'expand': should this be expanded to '\300' + 7 spaces + 'A',
> or '\342\221\300' + 5 spaces + 'A' ? or something else ?
> 
>     printf '\342\221\300\tA\n' | LC_ALL=en_US.UTF-8 expand
> 
> 
> 
> 'fold': should this print: 'aa\342\n\221\300b\n' (treating them as 
> single-bytes), or
> 'aa\300\nb\n' (using only the last octet), or something else?
> 
>     printf 'aa\342\221\300b\n' | LC_ALL=en_US.UTF-8 fold -w 3
> 
> 
> 'printf' - deals only with bytes. e.g. the following should be printed as-is:
> 
>     env printf '%s\n' "$(env printf '\342\221\300')"
>     env printf "$(env printf '\342\221\300')"
> 
> 
> 'fmt' and 'pr': I assume they should print the invalid sequence as is, as 
> they do not break mid-words.
> 
> 'head', 'tail', 'split' - not relevant as they deal with bytes, not 
> characters.
> 
> 'csplit': only indirectly relevant, as I seem to remember that standard regex 
> should never
> match an invalid multibyte sequence?
> 
> 'shuf','paste' - not relevant as it deals with complete lines.
> 
> 'yes' - prints input as-is, e.g. the following works:
> 
>     yes "$(env printf '\342\221\300')"
> 
> 'test' - operators '-n' and '-z' work correctly with invalid sequences.
> 
> 'expr': regex operations should never match (IIUC).
> for 'substr', should this return '\300' or '\342' ?
> 
>     LC_ALL=en_US.UTF-8 expr substr "$(printf '\342\221\300')" 1 1
> 
> for 'length', should this return 3 (treating as 3 single-bytes) or 1 
> (counting the last offending octet)?
> 
>     LC_ALL=en_US.UTF-8 expr length "$(printf '\342\221\300')"
> 
> for 'index', both STRING and CHAR might be invalid. Should an invalid CHAR 
> parameter be rejected outright ?
> 
> 'numfmt' - as long as it doesn't get confused with a digit character, invalid 
> sequences should be printed 'as-is'.
> 
> 'seq' - doesn't take any input.
> 
> 'date' - should print invalid characters in format string as-is.
> 
> 
> For now I'm going to side-step sort+join+uniq, as I think they present a more 
> complicated set of issues when it comes to multibyte processing.
> 
> 
> comments very welcomed,


It's worth considering having a separate (already existing?) util
to fix data before processing. That could have options to:
  drop invalid chars, replace with replacement char,
  apply various http://unicode.org/reports/tr15/#Norm_Forms,
  convert enclosed forms like ㊷ to 42 etc.
I.E. we should avoid complicating each util where possible,
and at least avoid having options on each util that could be
hoisted to a more general util like above.

Silently dropping invalid characters probably isn't a great idea,
and warnings to stderr is a bit messy and could be seen to contradict
POSIX which suggests exiting with failure if anything output to stderr.
A compromise might be to just replace invalid chars with
the replacement character � and then include that in
normal character processing, to make issues in input apparent.

cheers,
Pádraig

Re: multibyte processing - handling invalid sequences (long)

Reply via email to