Hello all, I'd like to discuss few aspect of multibyte processing in coreutils, as a preparation for future improvements.
To start with an "easy" topic: how to handle invalid input (i.e. input octets that result in invalid multibyte sequence). Previous discussion said no internal conversion to wchar_t so that invalid sequences can be handled as C locale ( https://lists.gnu.org/archive/html/coreutils/2010-09/msg00051.html ). Pádraig's i18n plan left the handling issue open ("How do we handle invalid encodings; substitution, elision, leaving in place?", http://www.pixelbeat.org/docs/coreutils_i18n/). Is there an agreement on how to handle those? Do we want to fall-back to C locale, and does that imply going back and revising invalid octets and re-processing them as single-byte characters ? If so, the implementation need to keep the N octets (up to MB_CUR_MAX), and be able to go back and process them. Alternatively, we can treat only the last octet (the offending one that caused the sequence to be invalid) as a single-byte character, thus possibly losing data. One possibility is to have all programs print an informative warning to stderr upon the detection of the first invalid multibyte sequence, then resort to 'best-effort' (e.g. only the last octet, or something else that's easy to implement). My rational is that for an input file with invalid sequences, there is no one correct solution that would satisfy all cases: some users would think the obvious correct solution is to output invalid sequences as-is, others would think they should be silently ignored (i.e. a program should never generate invalid output even on invalid input). The best we could do is warn them, and document a way to fix invalid files (along the lines of 'iconv --byte-subst="<0x%x>"'). Users could always fallback to forcing C locale and then all input bytes will be processed. To be more concrete, here are some examples: The unicode code-point U+2460 is 'CIRCLED DIGIT ONE', in UTF-8 octal: printf '\342\221\240' I'll use the invalid sequence '\342\221\300' as input below. What should be the output in the following cases: 'cut': should it print '\300' or '\342' ? printf '\342\221\300' | LC_ALL=en_US.UTF-8 cut -c1 'wc': should it print 1 (counting only '\300') or 3 (counting all octets) or 0 ? currently it prints 0 because it doesn't count invalid multibyte characters. printf '\342\221\300' | LC_ALL=en_US.UTF-8 wc -m similar issue, but perhaps with different logic and rationale, with "wc -L". 'expand': should this be expanded to '\300' + 7 spaces + 'A', or '\342\221\300' + 5 spaces + 'A' ? or something else ? printf '\342\221\300\tA\n' | LC_ALL=en_US.UTF-8 expand 'fold': should this print: 'aa\342\n\221\300b\n' (treating them as single-bytes), or 'aa\300\nb\n' (using only the last octet), or something else? printf 'aa\342\221\300b\n' | LC_ALL=en_US.UTF-8 fold -w 3 'printf' - deals only with bytes. e.g. the following should be printed as-is: env printf '%s\n' "$(env printf '\342\221\300')" env printf "$(env printf '\342\221\300')" 'fmt' and 'pr': I assume they should print the invalid sequence as is, as they do not break mid-words. 'head', 'tail', 'split' - not relevant as they deal with bytes, not characters. 'csplit': only indirectly relevant, as I seem to remember that standard regex should never match an invalid multibyte sequence? 'shuf','paste' - not relevant as it deals with complete lines. 'yes' - prints input as-is, e.g. the following works: yes "$(env printf '\342\221\300')" 'test' - operators '-n' and '-z' work correctly with invalid sequences. 'expr': regex operations should never match (IIUC). for 'substr', should this return '\300' or '\342' ? LC_ALL=en_US.UTF-8 expr substr "$(printf '\342\221\300')" 1 1 for 'length', should this return 3 (treating as 3 single-bytes) or 1 (counting the last offending octet)? LC_ALL=en_US.UTF-8 expr length "$(printf '\342\221\300')" for 'index', both STRING and CHAR might be invalid. Should an invalid CHAR parameter be rejected outright ? 'numfmt' - as long as it doesn't get confused with a digit character, invalid sequences should be printed 'as-is'. 'seq' - doesn't take any input. 'date' - should print invalid characters in format string as-is. For now I'm going to side-step sort+join+uniq, as I think they present a more complicated set of issues when it comes to multibyte processing. comments very welcomed, - assaf
