> Date: Tue, 5 Jun 2007 17:00:56 -0400 (EDT)
> From: Alan Stern <[EMAIL PROTECTED]>
>
> > > > Does anybody think it would be worthwhile to convert string descriptors 
> > > > from UCS-16 to UTF-8 (instead of Latin1) when we read them in?
> > 
> > Or even UTF-7 ... ?   FWIW the input isn't UCS-16; it's UTF16-LE.
>
> Do you happen to know where "UCS-16" is defined?

Not right off the bat, but see the Unicode 5.0 spec and

   http://www.ietf.org/rfc/rfc3629.txt

To a first approximation, "UCS-16 ~= Unicode without surrogates".
Unicode is at some level a subset of UCS-32 ... UCS-16 plus the
characters that can be accessed using surrogate pairs.

A UTF is a "UCS Transformation Format" ...  an encoding (bit pattern,
endianness matters) of what a "Universal Character Set" (UCS) represents
(logical "characters" associated with glyphs, associated with numbers).

So for example UTF-8 was originally defined as mapping UCS-32 into a
sequence of 8-bit bytes, although nowadays its more often treated as
support for Unicode, not UTF-32.  (See the predecessor of the above RFC
to the five and six byte UTF-8 encodings defined.)



> > > I remember that issue. I thought that we wanted some kind of escape
> > > syntax... Like what HTML uses with &#xxxx; perhaps. This would allow
> > > to edit xorg.conf on systems which are not UTF-8 clean. But perhaps
> > > it's a non-goal. How big is the code to convert (we need both ways,
> > > right)?
> > 
> > How big?  Not big.  UTF-16 to UTF-8 is a simple algorithm.  For
> > the reverse, see drivers/usb/gadget/usbstring.c ... the trick is
> > you'd need to know enough Unicode to not goof it.  Or, to find
> > some code that does it right.
>
> Here's a patch.  Anybody see anything wrong with it?  I don't have any 
> devices with non-ASCII characters in the default language descriptors 
> for testing.  It would be nice if there was a library in the kernel to 
> do these sorts of conversions, but there doesn't appear to be.

You could _start_ such a library, in lib/utf8.c or somesuch ...


> Nicolas, does it make your life any easier?
>
> Alan Stern
>
>
> Index: usb-2.6/drivers/usb/core/message.c
> ===================================================================
> --- usb-2.6.orig/drivers/usb/core/message.c
> +++ usb-2.6/drivers/usb/core/message.c
> @@ -731,24 +731,71 @@ static int usb_string_sub(struct usb_dev
>  }
>  
>  /**
> - * usb_string - returns ISO 8859-1 version of a string descriptor
> + * utf16le_to_utf8 - convert a string encoded in UTF-16LE to UTF-8
> + * @dst: the UTF-8 output buffer
> + * @dst_len: number of bytes available in @dest
> + * @src: the UTF-16LE input buffer
> + * @src_len: number of two-byte characters in @src
> + *
> + * Stores as many completely converted characters from @src as will fit
> + * in @dst (i.e., no partial character will remain at the end of @dst).
> + * No terminating NULL is appended to @dst.
> + *
> + * Returns the number of bytes stored in @dst.

Or negative error code ... you'll need error exits, see below.


> + */
> +static int utf16le_to_utf8(u8 *dst, size_t dst_len, u8 *src, size_t src_len)

I'd have expected "__le16 *src", maybe with read_unaligned()...
otherwise it'll be impossible to catch errors like wrongly
passing in some "__be16 *" data.


> +{
> +     unsigned c;
> +     u8 *d, *e1, *e2, *e3;
> +
> +     e1 = dst + dst_len - 1;
> +     e2 = e1 - 1;
> +     e3 = e2 - 1;
> +     for (d = dst; src_len > 0; (--src_len, src += 2)) {
> +             c = src[0] | (src[1] << 8);
> +             if (c < 0x80) {
> +                     /*  0******* */

It'd be clearer if you included the Unicode values in the
comments, not just the output bytes.  So:  one byte UTF-8
code points are less than U+0080 ...

> +                     if (d > e1)
> +                             break;
> +                     d[0] = c;
> +                     d += 1;
> +             } else if (c < 0x800) {
> +                     /* 110***** 10****** */

... two byte UTF-8 code points are less than U+0800 ...

> +                     if (d > e2)
> +                             break;
> +                     d[0] = 0xc0 | (c >> 6);
> +                     d[1] = 0x80 | (c & 0x3f);
> +                     d += 2;
> +             } else {
> +                     /* 1110**** 10****** 10****** */

... three-byte UTF-8 code points are everything else ...

... EXCEPT (!!) for surrogate characters which require the
four-byte encodings.  If you don't handle these correctly,
you need to fail instead of generating the wrong encoding.

Conceptually, what you need to do is map the UTF-16 surrogate
pairs into their full UCS code points, and then map those into
UTF-8.  UTF-16 inputs can have errors whereby the surrogate
codes are not correcly paired ... you should have an error
exit to report bogus UTF-16 format data.


I've seen descriptions of security bugs that hide inside such
transcoding bugs (mis-handling surrogates, etc).  So if this
weren't inside the kernel, that might not be a big deal ...

See the security issues in the RFC above for some hints about
some of the relevant attacks.

- Dave


> +                     if (d > e3)
> +                             break;
> +                     d[0] = 0xe0 | (c >> 12);
> +                     d[1] = 0x80 | ((c >> 6) & 0x3f);
> +                     d[2] = 0x80 | (c & 0x3f);
> +                     d += 3;
> +             }
> +     }
> +     return d - dst;
> +}
> +
> +/**
> + * usb_string - returns UTF-8 version of a string descriptor
>   * @dev: the device whose string descriptor is being retrieved
>   * @index: the number of the descriptor
>   * @buf: where to put the string
>   * @size: how big is "buf"?
>   * Context: !in_interrupt ()
>   * 
> - * This converts the UTF-16LE encoded strings returned by devices, from
> - * usb_get_string_descriptor(), to null-terminated ISO-8859-1 encoded ones
> - * that are more usable in most kernel contexts.  Note that all characters
> - * in the chosen descriptor that can't be encoded using ISO-8859-1
> - * are converted to the question mark ("?") character, and this function
> - * chooses strings in the first language supported by the device.
> + * This retrieves a UTF-16LE encoded string from a device and converts
> + * it to a NULL-terminated UTF-8 encoded string as used by the rest of
> + * the kernel.  Note that this function chooses strings in the first
> + * language supported by the device.
>   *
>   * The ASCII (or, redundantly, "US-ASCII") character set is the seven-bit
> - * subset of ISO 8859-1. ISO-8859-1 is the eight-bit subset of Unicode,
> - * and is appropriate for use many uses of English and several other
> - * Western European languages.  (But it doesn't include the "Euro" symbol.)
> + * subset of UTF-8.  Strings containing only ASCII characters appear exactly
> + * the same when encoded in UTF-8.  Characters (or "code-points") with
> + * values above 127 are encoded using multiple bytes.
>   *
>   * This call is synchronous, and may not be used in an interrupt context.
>   *
> @@ -758,7 +805,6 @@ int usb_string(struct usb_device *dev, i
>  {
>       unsigned char *tbuf;
>       int err;
> -     unsigned int u, idx;
>  
>       if (dev->state == USB_STATE_SUSPENDED)
>               return -EHOSTUNREACH;
> @@ -794,20 +840,12 @@ int usb_string(struct usb_device *dev, i
>       if (err < 0)
>               goto errout;
>  
> -     size--;         /* leave room for trailing NULL char in output buffer */
> -     for (idx = 0, u = 2; u < err; u += 2) {
> -             if (idx >= size)
> -                     break;
> -             if (tbuf[u+1])                  /* high byte */
> -                     buf[idx++] = '?';  /* non ISO-8859-1 character */
> -             else
> -                     buf[idx++] = tbuf[u];
> -     }
> -     buf[idx] = 0;
> -     err = idx;
> +     err = utf16le_to_utf8(buf, size - 1, &tbuf[2], (err - 2) / 2);
> +     buf[err] = 0;
>  
>       if (tbuf[1] != USB_DT_STRING)
> -             dev_dbg(&dev->dev, "wrong descriptor type %02x for string %d 
> (\"%s\")\n", tbuf[1], index, buf);
> +             dev_dbg(&dev->dev, "wrong descriptor type %02x for string "
> +                             "%d (\"%s\")\n", tbuf[1], index, buf);
>  
>   errout:
>       kfree(tbuf);
>
>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
linux-usb-devel@lists.sourceforge.net
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel

Reply via email to