On Sun, Feb 12, 2006 at 05:27:16PM +0000, Markus Kuhn wrote: > If you worry about UTF-16 at all, then I think you should also worry > about these two [fffe and ffff]. Otherwise, there is no point in > worrying about surrogates either.
Actually this got me thinking about whether it's necessary or appropriate to bother with signalling errors for surrogate codepoints and noncharacters at all when decoding UTF-8 in mb[r]towc or other similar interfaces. The error conditions basically are: - Overly long representations: these are inherently a security problem when using UTF-8 because they make the round-trip map between UTF-8 and UCS a non-identity map in some cases. - Surrogates: these have no security implications as long as the encodings in use are only UTF-8 and UCS character numbers (wchar_t). They only become a problem if someone converts to UTF-16 by applying the identity map to all code points below 0x10000 without checking for illegal surrogates, in which case their presence will make the round trip between UTF-8 and UTF-16 non-identity. - FFFE: no implications for UTF-8 and wchar_t only system. When converted to UTF-16 or UTF-32, may cause systems which honor a BOM to misinterpret the text entirely, which may have security implications (e.g. 2F00 gets interpreted as 002F). - FFFF: may be interpreted as WEOF by broken systems with 16bit wchar_t. Otherwise a non-issue. If UTF-8 is going to be the universal character encoding on *nix systems (and hopefully Internet protocols, embedded systems, and all other non-MS systems) for the forseeable future, it's in the utmost interest of users for performance to be maximized and code size to be minimized. Otherwise there is a strong urge to stick with legacy 8bit encodings. Of the above error conditions, only overly long sequences affect a system that only uses UTF-8 and wchar_t, which is the vast majority of applications. I strongly wonder whether checking for surrogates and illegal noncharacter codepoints should be moved to the UTF-16 encoder (in iconv, or other implementations) and omitted from the UTF-8 decoder. The benefits: - In the naive C implementation with conditional branches for all the error condition checks, this eliminates two subtractions and two conditional branches per 3-byte sequence (basically all Asian scripts). In very naive implementations, these operations would have been performed for ALL non-ASCII characters. - In the optimized C implementation with bit twiddling for error conditions, this eliminates 4 subtractions, 2 bitwise ors, and 1 bitshift per 3-byte sequence. Cache impact of reduced code should be significant. - In my heavily optimized x86 implementation, this eliminates 19 bytes of code (~10% of the total function, and closer to 20% if you only count the code that gets executed for BMP characters), comprising 7 instructions with heavy data dependencies between them, per 3-byte sequence. I would estimate about 20 cycles on a modern cpu, plus time saved due to lowered cache impact. Naturally the worth of these gains is very questionable. NOT because computers are "getting faster" -- the idea that you can write slow code because Western Europe and America have fast computers should not be tolerated among people interested in i18n and m17n for a second!! -- but because the gains are _fairly_ small. On the other hand, the practical benefits of signalling surrogates and fffe/ffff as errors in an application which does not deal with UTF-16 are nonexistant. Markus, Bruno, and others: I'd like to hear your opinions on this matter. FYI: isomorphism between malformed UTF-8 and invalid wchar_t values is totally possible without excluding surrogates. Only the ideas for isomorphism to malformed UTF-16 suffer. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/