On Wednesday 06 June 2007, Alan Stern wrote:
> It turns out that fs/nls/nls_base.c already includes conversion
> routines such as utf8_wcstombs().  Unforunately they are not an ideal
> match to what we want for several reasons:
> 
> ...

One of the ongoing headaches of I18N work is backwards compat
with old bugs.  Even ones with security ramifications... and
basic conceptual goofs, like assuming that a "character" will
fit into a single "char" (or more recently, 16-bit "wchar_t").


> A few other source files, such as drivers/s390/char/keyboard.c, include
> their own home-brewed UTF-8 conversions.  In some cases the converted
> characters aren't stored in a buffer; they are passed to various inline
> routines.  In no cases are surrogate pairs handled correctly.

Correct handling of surrogates has always been a problem.
Just like having correct utilities to build on has been...

It's stuff like this which makes companies create job titles
like "Internationalization Guru", and give them enough clout
to go fix all the little fires burning all over the place.

I noticed a few years ago that nobody in Linux had declared
themselves the kernel I18N guru and done so.  It's sort of
understandable -- it's a *huge* area, less thankful than
most, and folk with the relevant knowledge tend not to have
any street cred among kernel folk -- but I'm surprised that
even the basics like UTF-{8,16LE} are as broken as you report.
That's the _easy_ stuff!


> In fact, as far as I can see no part of the kernel is prepared to
> handle Unicode values higher than U+ffff.

The idea is that they shouldn't need to know that's what
they're doing ... because proper string handling ensures
any characters in the "astral planes" are transparently
passed through.  (Such policies are what make Unicode more
than just a character set.)

That relies on correct transcoding though.  And some of
that code you pointed to clearly dates from very early
implementations, before people became very aware of the
security (including integrity) consequences of some of
those shortcuts ... like assuming for example that no
characters need more than 16 bits.

The problem with characters using the astral planes is
right now mostly that they're an attack vector because
of buggy implementations ... there aren't many real
users of those characters.

Rejecting surrogates would be a **MUCH** better strategy
than mis-handling them.


> So the situation is a mess.  I'm not even sure what features a library 
> ought to provide.  Something from the following selection:
> 
>       Store output in a buffer or send it to an output routine.

My answer would be to adopt C strings in UTF-8 as the standard
representation.  Output logic could translate a few characters
at a time and send them along, if for some reason it's got to
take UCS-16 (or Unicode) inputs.

>       Native byte-order, little-endian, or big-endian.

Native order is IMO not much use, unless you're trying to
use Unicode native; it's best as an external representation.
At least in kernel terms.
 
>       Return an error for invalid codes or ignore them.

Error is safest...  but remember, even with Unicode it's
going to be "invalid strings".

But also:  if you're given a filesystem with names stored in
Unicode, unpaired surrogates are errors I'd expect FSCK would
handle ... but preventing access to those files would likely
be a Bad Thing.

>       Stop at NUL or take a length argument.

Both options are needed.

> Any suggestions for the best way to organize all this?

As I suggested before:  core transcoding routines which are
actually correct, and have options for those features above.
Then provide simple wrappers for primary option combinations.

- Dave


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
linux-usb-devel@lists.sourceforge.net
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel

Reply via email to