Re: utf-8 and well-formed but illegal chars

Rich Felker Sun, 12 Feb 2006 12:32:30 -0800

On Sun, Feb 12, 2006 at 05:27:16PM +0000, Markus Kuhn wrote:
> Rich Felker wrote on 2006-02-11 19:43 UTC:
> > > In particular:
> > > 
> > >   - 0xfffe could be misinterpreted by a later process as an anti-BOM, and
> > 
> > This could be an issue on Windows/Java, but neither BOMs nor UTF-16
> > are used on unix systems, for very good reason.
> 
> But many "unix systems" sit in heterogeneous environments with UTF-16
> protocols (NTFS, CIFS, etc.), talk to Windows/Java platforms, may even
> run or be ported onto Windows, or host ported Windows applications. The
> world is not always as simple as we would like it to be. :(


Yes and no. This UTF-16 processing should be isolated to programs that
actually deal with the windows data, such as samba. The only other
program I can think of that should ever have to handle UTF-16 on a
unix system is a web browser, for decoding UTF-16 documents served by
severely misconfigured (huge waste of bandwidth) windows webservers.
(Even if all the content is CJK, HTML is full of bloated ascii tags
which will all double in size with UTF-16, negating any marginal size
savings.)

> > >   - 0xffff equals FEOF in most sizeof(wchar_t)=2 implementations.
> > 
> > Any implementation with sizeof(wchar_t) is not ISO C compliant,
> 
> Your interpretation of the holy book of ISO C! Systems with wchar_t =
> uint16 exist and are widely deployed. People fight wars about holy books.
> 
> If you worry about UTF-16 at all, then I think you should also worry
> about these two. Otherwise, there is no point in worrying about
> surrogates either.

There is a big difference. If surrogates decoded from UTF-8 are
converted to UTF-16 wrongly, it creates two ways of representing the
same character (major security issue). Otherwise you just end up with
invalid characters in the output (which you should be checking for on
write, anyway). (There's actually the issue with BOM too, which could
have security implications.. I still can't believe people actually do
something so stupid and blatently incorrect as processing a BOM..)

> > The standard says that wchar_t represents a whole
> > character and that there are no such things as multi-wchar_t
> > characters or shift states for wchar_t.
> 
> The wchar_t parts of ISO C that you refer to were written before 1995
> (Amendment 1) by people primarily interested in EUC and other ISO
> 2022-like schemes, and were not substantially revised since. UTF-16
> (published 1996) was not around when the text you now interpret was
> written. You may be stretching the standard beyond is interpretation
> capacity, if you unify terms like "character" from different committees
> and epochs in character-set history.

I agree it's dangerous to equate different definitions of character,
especially since ISO C uses "character" (without the word multibyte)
to mean "byte". However, the part about lack of shift/decoding state
is fairly clear that multi-wchar_t character encoding is not supposed
to exist.

> (Don't missunderstand me, I am not a fan of sizeof(wchar_t)==2; I merely
> want to warn of the limits of interpretating things into standards that
> the authors could not have been aware of, such as Unicode's current
> character and encoding model.)

IIRC the same language appears in C99, although I've only read the
draft myself, not the final standard, and not in detail.

> > I agree that noncharacter code points should not be written to text
> > files for interchange with other applications; however, they are valid
> > for internal use, and must remain valid for internal use invariant
> > under conversions between different encoding forms. Forbidding them
> > belongs at the level of the file writer in an application generating
> > files for interchange, not at the mbrtowc/wcrtomb level. The latter
> > could interfere with legitimate internal processing, and just slows
> > UTF-8 processing down even more.
> 
> Pah, new-fangled, misguided Unicode view of the world! :)
> Holy books are best when they are old ...

Indeed! However holy books are more accessible when they're published
publicly on the web.

> ISO 10646-1:1993/Am.2 (1996), section R.4, forbids both U+FFFF and
> U+FFFE in UTF-8:
> 
>   "NOTE 3 - Values of x in the range 0000 D800 .. 0000 DFFF are reserved
>    for the UTF-16 form and do not occur in UCS-4. The values 0000 FFFE and
>    0000 FFFF also do not occur (see clause 8). The mappings of these code
>    positions in UTF-8 are undefined."
> 
> http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html

OK, in my view ISO-10646 trumps Unicode, so I will accept your
conclusion on the matter. (Unicode is full of wordprocessor-oriented,
Windows-oriented, 16bit-oriented, etc. crap. Sometimes it's useful for
deriving things like character classes, etc. so I don't want to
dismiss it entirely, but ISO-10646 is much more vendor-neutral and
lacks various stupid semantic requirements that conflict with C and
POSIX.)

BTW your link seems to be to an old version, since my understanding is
that ISO-10646 has since forbidden overlong character encodings (and
also code points above 10ffff..?).

> > P.S. Did your old thread about converting invalid bytes to high
> > surrogate codepoints (for binary-clean in-band error reporting) when
> > decoding UTF-8 ever reach a conclusion? I found part of the thread
> > while doing a search, but didn't see where it ended up.
> 
> I spend some time investigating a schemes that create an isomorphic
> mappings between malformed UTF-8 and malformed UTF-16 schemes. They all
> got horriby complicated and unpleasant. I don't think that there is a
> neat and efficient isomorphic mapping. The simple approach is to define
> two separate surjective encodings, one to prepresent malformed UTF-8 in
> UTF-16, and the other to represent malformed UTF-16 in UTF-8, without
> asking for one to be the inverse of the other.

IMO there's no good solution. Any such conversion is subject to the
flaw that string concatenation and conversion between encodings do not
commute, which is a Very Bad Thing and could have security
implications. Unless you have a better solution, my view is that
applications wishing to be binary-clean should either keep the data as
bytes internally (processing it as UTF-8 in a 'JIT' manner for
display, searching, etc.) or use their own internal representation.
Regardless, the C library implementation should do nothing but signal
(OOB) error on invalid sequences and leave additional handling to the
application. If you have a different view I'd be happy to hear it.

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: utf-8 and well-formed but illegal chars

Reply via email to