On 20/07/16 07:11, Assaf Gordon wrote: > Hello all, > > I'd like to discuss few aspect of multibyte processing in coreutils, as a > preparation for future improvements. > > To start with an "easy" topic: how to handle invalid input (i.e. input octets > that result in invalid multibyte sequence). > Previous discussion said no internal conversion to wchar_t so that invalid > sequences can be handled as C locale ( > https://lists.gnu.org/archive/html/coreutils/2010-09/msg00051.html ). > Pádraig's i18n plan left the handling issue open ("How do we handle invalid > encodings; substitution, elision, leaving in place?", > http://www.pixelbeat.org/docs/coreutils_i18n/). > > Is there an agreement on how to handle those? > > Do we want to fall-back to C locale, and does that imply going back and > revising invalid octets and re-processing them as single-byte characters ? > If so, the implementation need to keep the N octets (up to MB_CUR_MAX), and > be able to go back and process them. Alternatively, we can treat only the > last octet (the offending one that caused the sequence to be invalid) as a > single-byte character, thus possibly losing data. > > > One possibility is to have all programs print an informative warning to > stderr upon the detection of the first invalid multibyte sequence, then > resort to 'best-effort' (e.g. only the last octet, or something else that's > easy to implement). > My rational is that for an input file with invalid sequences, there is no one > correct solution that would satisfy all cases: some users would think the > obvious correct solution is to output invalid sequences as-is, others would > think they should be silently ignored (i.e. a program should never generate > invalid output even on invalid input). > The best we could do is warn them, and document a way to fix invalid files > (along the lines of 'iconv --byte-subst="<0x%x>"'). Users could always > fallback to forcing C locale and then all input bytes will be processed. > > > > To be more concrete, here are some examples: > > The unicode code-point U+2460 is 'CIRCLED DIGIT ONE', > in UTF-8 octal: printf '\342\221\240' > I'll use the invalid sequence '\342\221\300' as input below. > > What should be the output in the following cases: > > 'cut': should it print '\300' or '\342' ? > > printf '\342\221\300' | LC_ALL=en_US.UTF-8 cut -c1 > > > 'wc': should it print 1 (counting only '\300') or 3 (counting all octets) or > 0 ? > currently it prints 0 because it doesn't count invalid multibyte characters. > > printf '\342\221\300' | LC_ALL=en_US.UTF-8 wc -m > > similar issue, but perhaps with different logic and rationale, with "wc -L". > > > 'expand': should this be expanded to '\300' + 7 spaces + 'A', > or '\342\221\300' + 5 spaces + 'A' ? or something else ? > > printf '\342\221\300\tA\n' | LC_ALL=en_US.UTF-8 expand > > > > 'fold': should this print: 'aa\342\n\221\300b\n' (treating them as > single-bytes), or > 'aa\300\nb\n' (using only the last octet), or something else? > > printf 'aa\342\221\300b\n' | LC_ALL=en_US.UTF-8 fold -w 3 > > > 'printf' - deals only with bytes. e.g. the following should be printed as-is: > > env printf '%s\n' "$(env printf '\342\221\300')" > env printf "$(env printf '\342\221\300')" > > > 'fmt' and 'pr': I assume they should print the invalid sequence as is, as > they do not break mid-words. > > 'head', 'tail', 'split' - not relevant as they deal with bytes, not > characters. > > 'csplit': only indirectly relevant, as I seem to remember that standard regex > should never > match an invalid multibyte sequence? > > 'shuf','paste' - not relevant as it deals with complete lines. > > 'yes' - prints input as-is, e.g. the following works: > > yes "$(env printf '\342\221\300')" > > 'test' - operators '-n' and '-z' work correctly with invalid sequences. > > 'expr': regex operations should never match (IIUC). > for 'substr', should this return '\300' or '\342' ? > > LC_ALL=en_US.UTF-8 expr substr "$(printf '\342\221\300')" 1 1 > > for 'length', should this return 3 (treating as 3 single-bytes) or 1 > (counting the last offending octet)? > > LC_ALL=en_US.UTF-8 expr length "$(printf '\342\221\300')" > > for 'index', both STRING and CHAR might be invalid. Should an invalid CHAR > parameter be rejected outright ? > > 'numfmt' - as long as it doesn't get confused with a digit character, invalid > sequences should be printed 'as-is'. > > 'seq' - doesn't take any input. > > 'date' - should print invalid characters in format string as-is. > > > For now I'm going to side-step sort+join+uniq, as I think they present a more > complicated set of issues when it comes to multibyte processing. > > > comments very welcomed,
It's worth considering having a separate (already existing?) util to fix data before processing. That could have options to: drop invalid chars, replace with replacement char, apply various http://unicode.org/reports/tr15/#Norm_Forms, convert enclosed forms like ㊷ to 42 etc. I.E. we should avoid complicating each util where possible, and at least avoid having options on each util that could be hoisted to a more general util like above. Silently dropping invalid characters probably isn't a great idea, and warnings to stderr is a bit messy and could be seen to contradict POSIX which suggests exiting with failure if anything output to stderr. A compromise might be to just replace invalid chars with the replacement character � and then include that in normal character processing, to make issues in input apparent. cheers, Pádraig
