On Tue, 5 Jun 2007, David Brownell wrote:

> > Date: Tue, 5 Jun 2007 17:00:56 -0400 (EDT)
> > From: Alan Stern <[EMAIL PROTECTED]>
> >
> > > > > Does anybody think it would be worthwhile to convert string 
> > > > > descriptors 
> > > > > from UCS-16 to UTF-8 (instead of Latin1) when we read them in?
> > > 
> > > Or even UTF-7 ... ?   FWIW the input isn't UCS-16; it's UTF16-LE.
> >
> > Do you happen to know where "UCS-16" is defined?
> 
> Not right off the bat, but see the Unicode 5.0 spec and
> 
>    http://www.ietf.org/rfc/rfc3629.txt
> 
> To a first approximation, "UCS-16 ~= Unicode without surrogates".
> Unicode is at some level a subset of UCS-32 ... UCS-16 plus the
> characters that can be accessed using surrogate pairs.
> 
> A UTF is a "UCS Transformation Format" ...  an encoding (bit pattern,
> endianness matters) of what a "Universal Character Set" (UCS) represents
> (logical "characters" associated with glyphs, associated with numbers).
> 
> So for example UTF-8 was originally defined as mapping UCS-32 into a
> sequence of 8-bit bytes, although nowadays its more often treated as
> support for Unicode, not UTF-32.  (See the predecessor of the above RFC
> to the five and six byte UTF-8 encodings defined.)

Thanks for the explanation.  I see from the RFC that the correct names 
are UCS-2 and UCS-4, not UCS-16 and UCS-32.

It's a shame that so few parts of the Unicode standard are freely
available.

> > Here's a patch.  Anybody see anything wrong with it?  I don't have any 
> > devices with non-ASCII characters in the default language descriptors 
> > for testing.  It would be nice if there was a library in the kernel to 
> > do these sorts of conversions, but there doesn't appear to be.
> 
> You could _start_ such a library, in lib/utf8.c or somesuch ...

I'll look into it.  The actual conversion code was taken from elsewhere 
in the kernel, so starting a library isn't a bad idea.

> > + */
> > +static int utf16le_to_utf8(u8 *dst, size_t dst_len, u8 *src, size_t 
> > src_len)
> 
> I'd have expected "__le16 *src", maybe with read_unaligned()...
> otherwise it'll be impossible to catch errors like wrongly
> passing in some "__be16 *" data.

Or some other type-safe approach.  For a library it's a little awkward 
to have duplicate little-endian and big-endian versions of everything.

I did it this way here because that's how usb_string() passes its 
arguments.  Changing to "__le16 *src" would require a cast in the 
caller, which kind of defeats the purpose.  But at least it is 
explicit.

> > +{
> > +   unsigned c;
> > +   u8 *d, *e1, *e2, *e3;
> > +
> > +   e1 = dst + dst_len - 1;
> > +   e2 = e1 - 1;
> > +   e3 = e2 - 1;
> > +   for (d = dst; src_len > 0; (--src_len, src += 2)) {
> > +           c = src[0] | (src[1] << 8);
> > +           if (c < 0x80) {
> > +                   /*  0******* */
> 
> It'd be clearer if you included the Unicode values in the
> comments, not just the output bytes.  So:  one byte UTF-8
> code points are less than U+0080 ...

Okay.

> > +                   if (d > e1)
> > +                           break;
> > +                   d[0] = c;
> > +                   d += 1;
> > +           } else if (c < 0x800) {
> > +                   /* 110***** 10****** */
> 
> ... two byte UTF-8 code points are less than U+0800 ...
> 
> > +                   if (d > e2)
> > +                           break;
> > +                   d[0] = 0xc0 | (c >> 6);
> > +                   d[1] = 0x80 | (c & 0x3f);
> > +                   d += 2;
> > +           } else {
> > +                   /* 1110**** 10****** 10****** */
> 
> ... three-byte UTF-8 code points are everything else ...
> 
> ... EXCEPT (!!) for surrogate characters which require the
> four-byte encodings.  If you don't handle these correctly,
> you need to fail instead of generating the wrong encoding.
> 
> Conceptually, what you need to do is map the UTF-16 surrogate
> pairs into their full UCS code points, and then map those into
> UTF-8.  UTF-16 inputs can have errors whereby the surrogate
> codes are not correcly paired ... you should have an error
> exit to report bogus UTF-16 format data.

I'll add it.  It's worth noting that the other existing routines from 
which I borrowed this algorithm don't handle surrogate pairs.

> I've seen descriptions of security bugs that hide inside such
> transcoding bugs (mis-handling surrogates, etc).  So if this
> weren't inside the kernel, that might not be a big deal ...
> 
> See the security issues in the RFC above for some hints about
> some of the relevant attacks.

> Another security risk:  emitting a NUL before the end of the
> string.  Treat UTF-16 inputs like { 'a', 'b', 0, 'c', 'd' }
> as errors, don't just return "ab\0cd".

It's hard to know what to make of this one.  In some sense it isn't a
fault of the conversion routine, because it is documented as returning
a byte array and a length -- not a NUL-terminated string.  However the
caller is liable to use it as a string (like usb_string() does), so
your suggestion seems to be the safest course.

Another potential error is truncation because the destination buffer is 
too short.  Since snprintf() doesn't treat truncation as an error, I 
guess this routine shouldn't either.

Alan Stern


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
linux-usb-devel@lists.sourceforge.net
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel

Reply via email to