On Sun, Jan 09, 2011 at 10:21:50PM +0000, Thorsten Glaser wrote:
> Roger Leigh dixit:
> 
> >From my reading of the standards a UTF-8 C locale would be required
> >to behave identically to the existing ASCII C locale:
> >
> >• will consider all byte sequences valid
> 
> I think it wouldn’t (since UTF-8 mbrtowc/wcrtomb don’t work
> this way, and it can’t be done with “just” the POSIX API
> anyway because they aren’t allowed to not read any input
> byte when outputting (in MirBSD, I’ve added a sister func-
> tion to mbrtowc which can do that), so not everything can
> be accepted in all situations.

If you are using multibyte functions, then I agree these are special
cases.  For these to function correctly, they do require valid input.
They would of course fail when run in a UTF-8 C locale.  However, they
should fail in an ASCII C locale as well (I should test this) given
that the wide character representation is always UCS-4 on GNU/Linux
and an e.g. latin1 sequence wouldn't be valid UTF-8.

I think the "all byte sequences valid" applies mainly to narrow
character I/O.  i.e. printf/puts etc. won't alter, drop or otherwise
mangle any non 7-bit-ASCII codes.  i.e. I think the intent was to
ensure 8-bit cleanliness in a 7-bit locale.  This naturally extends
to UTF-8.  I'm not sure that wide character support is implied here,
given that it implicity requires correct byte sequences to function
where the narrow character I/O does not (all 8-bit codes are correct).


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.

Attachment: signature.asc
Description: Digital signature

Reply via email to