Re: utf-8 and well-formed but illegal chars

Rich Felker Tue, 14 Feb 2006 10:37:57 -0800

On Sun, Feb 12, 2006 at 05:27:16PM +0000, Markus Kuhn wrote:
> If you worry about UTF-16 at all, then I think you should also worry
> about these two [fffe and ffff]. Otherwise, there is no point in
> worrying about surrogates either.


Actually this got me thinking about whether it's necessary or
appropriate to bother with signalling errors for surrogate codepoints
and noncharacters at all when decoding UTF-8 in mb[r]towc or other
similar interfaces. The error conditions basically are:

- Overly long representations: these are inherently a security problem
  when using UTF-8 because they make the round-trip map between UTF-8
  and UCS a non-identity map in some cases.

- Surrogates: these have no security implications as long as the
  encodings in use are only UTF-8 and UCS character numbers (wchar_t).
  They only become a problem if someone converts to UTF-16 by applying
  the identity map to all code points below 0x10000 without checking
  for illegal surrogates, in which case their presence will make the
  round trip between UTF-8 and UTF-16 non-identity.

- FFFE: no implications for UTF-8 and wchar_t only system. When
  converted to UTF-16 or UTF-32, may cause systems which honor a BOM
  to misinterpret the text entirely, which may have security
  implications (e.g. 2F00 gets interpreted as 002F).

- FFFF: may be interpreted as WEOF by broken systems with 16bit
  wchar_t. Otherwise a non-issue.

If UTF-8 is going to be the universal character encoding on *nix
systems (and hopefully Internet protocols, embedded systems, and all
other non-MS systems) for the forseeable future, it's in the utmost
interest of users for performance to be maximized and code size to be
minimized. Otherwise there is a strong urge to stick with legacy 8bit
encodings.

Of the above error conditions, only overly long sequences affect a
system that only uses UTF-8 and wchar_t, which is the vast majority of
applications. I strongly wonder whether checking for surrogates and
illegal noncharacter codepoints should be moved to the UTF-16 encoder
(in iconv, or other implementations) and omitted from the UTF-8
decoder. The benefits:

- In the naive C implementation with conditional branches for all the
  error condition checks, this eliminates two subtractions and two
  conditional branches per 3-byte sequence (basically all Asian
  scripts). In very naive implementations, these operations would have
  been performed for ALL non-ASCII characters.

- In the optimized C implementation with bit twiddling for error
  conditions, this eliminates 4 subtractions, 2 bitwise ors, and 1
  bitshift per 3-byte sequence. Cache impact of reduced code should be
  significant.

- In my heavily optimized x86 implementation, this eliminates 19 bytes
  of code (~10% of the total function, and closer to 20% if you only
  count the code that gets executed for BMP characters), comprising 7
  instructions with heavy data dependencies between them, per 3-byte
  sequence. I would estimate about 20 cycles on a modern cpu, plus
  time saved due to lowered cache impact.

Naturally the worth of these gains is very questionable. NOT because
computers are "getting faster" -- the idea that you can write slow
code because Western Europe and America have fast computers should not
be tolerated among people interested in i18n and m17n for a second!!
-- but because the gains are _fairly_ small. On the other hand, the
practical benefits of signalling surrogates and fffe/ffff as errors in
an application which does not deal with UTF-16 are nonexistant.

Markus, Bruno, and others: I'd like to hear your opinions on this
matter. FYI: isomorphism between malformed UTF-8 and invalid wchar_t
values is totally possible without excluding surrogates. Only the
ideas for isomorphism to malformed UTF-16 suffer.

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: utf-8 and well-formed but illegal chars

Reply via email to