Rich Felker wrote on 2006-02-11 19:43 UTC: > > In particular: > > > > - 0xfffe could be misinterpreted by a later process as an anti-BOM, and > > This could be an issue on Windows/Java, but neither BOMs nor UTF-16 > are used on unix systems, for very good reason.
But many "unix systems" sit in heterogeneous environments with UTF-16 protocols (NTFS, CIFS, etc.), talk to Windows/Java platforms, may even run or be ported onto Windows, or host ported Windows applications. The world is not always as simple as we would like it to be. :( > > - 0xffff equals FEOF in most sizeof(wchar_t)=2 implementations. > > Any implementation with sizeof(wchar_t) is not ISO C compliant, Your interpretation of the holy book of ISO C! Systems with wchar_t = uint16 exist and are widely deployed. People fight wars about holy books. If you worry about UTF-16 at all, then I think you should also worry about these two. Otherwise, there is no point in worrying about surrogates either. > The standard says that wchar_t represents a whole > character and that there are no such things as multi-wchar_t > characters or shift states for wchar_t. The wchar_t parts of ISO C that you refer to were written before 1995 (Amendment 1) by people primarily interested in EUC and other ISO 2022-like schemes, and were not substantially revised since. UTF-16 (published 1996) was not around when the text you now interpret was written. You may be stretching the standard beyond is interpretation capacity, if you unify terms like "character" from different committees and epochs in character-set history. (Don't missunderstand me, I am not a fan of sizeof(wchar_t)==2; I merely want to warn of the limits of interpretating things into standards that the authors could not have been aware of, such as Unicode's current character and encoding model.) > I agree that noncharacter code points should not be written to text > files for interchange with other applications; however, they are valid > for internal use, and must remain valid for internal use invariant > under conversions between different encoding forms. Forbidding them > belongs at the level of the file writer in an application generating > files for interchange, not at the mbrtowc/wcrtomb level. The latter > could interfere with legitimate internal processing, and just slows > UTF-8 processing down even more. Pah, new-fangled, misguided Unicode view of the world! :) Holy books are best when they are old ... ISO 10646-1:1993/Am.2 (1996), section R.4, forbids both U+FFFF and U+FFFE in UTF-8: "NOTE 3 - Values of x in the range 0000 D800 .. 0000 DFFF are reserved for the UTF-16 form and do not occur in UCS-4. The values 0000 FFFE and 0000 FFFF also do not occur (see clause 8). The mappings of these code positions in UTF-8 are undefined." http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html > P.S. Did your old thread about converting invalid bytes to high > surrogate codepoints (for binary-clean in-band error reporting) when > decoding UTF-8 ever reach a conclusion? I found part of the thread > while doing a search, but didn't see where it ended up. I spend some time investigating a schemes that create an isomorphic mappings between malformed UTF-8 and malformed UTF-16 schemes. They all got horriby complicated and unpleasant. I don't think that there is a neat and efficient isomorphic mapping. The simple approach is to define two separate surjective encodings, one to prepresent malformed UTF-8 in UTF-16, and the other to represent malformed UTF-16 in UTF-8, without asking for one to be the inverse of the other. Markus -- Markus Kuhn, Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/