On Mon, May 05, 2014 at 08:08:32PM +0100, Sam Liddicott wrote: > One of the advantages of utf-8 encoding was that it was easy to re-sync > after an invalid sequence. > > It's a bit of a waste to then not do that. Minus points for musl.
An application can resync, although the C multibyte interfaces are not really designed to be used this way (and you have to be careful if the locale's encoding might be state-dependent, e.g. some legacy CJK encodings). However the implementation cannot silently resync behind your back. Doing so introduces serious bugs, some of which may be security-relevant, since you either silently miss seeing some bytes from the input when processing input via conversion to wide characters, or some invalid sequences appear to the application as valid. Either possibility is dangerous. In particular, it's wrong for the regex "." to match anything that's an illegal sequence, and wrong for "^.*$" to match a line containing any illegal sequences (since the "." can't match it). > Can you not run sed with LANG=C or LANG=POSIX? That's not what they're doing, but it's not a solution anyway. ISO C leaves the character encoding of the C locale implementation-defined, and the Rationale text from the 1995 amendments to C explicitly allows for the possibility that the C locale's character encoding has multibyte characters (e.g. is UTF-8). musl presently does not support byte-based characters at all, only UTF-8. This conforms to the current versions of ISO C and POSIX, but the Austin Group has adopted a requirement that the C locale be "8 bit clean" as a future requirement, which musl will probably support at some time in the future. Rich _______________________________________________ busybox mailing list [email protected] http://lists.busybox.net/mailman/listinfo/busybox
