Re: utf-8 and well-formed but illegal chars

Markus Kuhn Sun, 12 Feb 2006 09:28:35 -0800

Rich Felker wrote on 2006-02-11 19:43 UTC:
> > In particular:
> > 
> >   - 0xfffe could be misinterpreted by a later process as an anti-BOM, and
> 
> This could be an issue on Windows/Java, but neither BOMs nor UTF-16
> are used on unix systems, for very good reason.


But many "unix systems" sit in heterogeneous environments with UTF-16
protocols (NTFS, CIFS, etc.), talk to Windows/Java platforms, may even
run or be ported onto Windows, or host ported Windows applications. The
world is not always as simple as we would like it to be. :(

> >   - 0xffff equals FEOF in most sizeof(wchar_t)=2 implementations.
> 
> Any implementation with sizeof(wchar_t) is not ISO C compliant,

Your interpretation of the holy book of ISO C! Systems with wchar_t =
uint16 exist and are widely deployed. People fight wars about holy books.

If you worry about UTF-16 at all, then I think you should also worry
about these two. Otherwise, there is no point in worrying about
surrogates either.

> The standard says that wchar_t represents a whole
> character and that there are no such things as multi-wchar_t
> characters or shift states for wchar_t.

The wchar_t parts of ISO C that you refer to were written before 1995
(Amendment 1) by people primarily interested in EUC and other ISO
2022-like schemes, and were not substantially revised since. UTF-16
(published 1996) was not around when the text you now interpret was
written. You may be stretching the standard beyond is interpretation
capacity, if you unify terms like "character" from different committees
and epochs in character-set history.

(Don't missunderstand me, I am not a fan of sizeof(wchar_t)==2; I merely
want to warn of the limits of interpretating things into standards that
the authors could not have been aware of, such as Unicode's current
character and encoding model.)

> I agree that noncharacter code points should not be written to text
> files for interchange with other applications; however, they are valid
> for internal use, and must remain valid for internal use invariant
> under conversions between different encoding forms. Forbidding them
> belongs at the level of the file writer in an application generating
> files for interchange, not at the mbrtowc/wcrtomb level. The latter
> could interfere with legitimate internal processing, and just slows
> UTF-8 processing down even more.

Pah, new-fangled, misguided Unicode view of the world! :)
Holy books are best when they are old ...

ISO 10646-1:1993/Am.2 (1996), section R.4, forbids both U+FFFF and
U+FFFE in UTF-8:

  "NOTE 3 - Values of x in the range 0000 D800 .. 0000 DFFF are reserved
   for the UTF-16 form and do not occur in UCS-4. The values 0000 FFFE and
   0000 FFFF also do not occur (see clause 8). The mappings of these code
   positions in UTF-8 are undefined."

http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html

> P.S. Did your old thread about converting invalid bytes to high
> surrogate codepoints (for binary-clean in-band error reporting) when
> decoding UTF-8 ever reach a conclusion? I found part of the thread
> while doing a search, but didn't see where it ended up.

I spend some time investigating a schemes that create an isomorphic
mappings between malformed UTF-8 and malformed UTF-16 schemes. They all
got horriby complicated and unpleasant. I don't think that there is a
neat and efficient isomorphic mapping. The simple approach is to define
two separate surjective encodings, one to prepresent malformed UTF-8 in
UTF-16, and the other to represent malformed UTF-16 in UTF-8, without
asking for one to be the inverse of the other.

Markus

-- 
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: utf-8 and well-formed but illegal chars

Reply via email to