> Date: Tue, 5 Jun 2007 17:00:56 -0400 (EDT) > From: Alan Stern <[EMAIL PROTECTED]> > > > > > Does anybody think it would be worthwhile to convert string descriptors > > > > from UCS-16 to UTF-8 (instead of Latin1) when we read them in? > > > > Or even UTF-7 ... ? FWIW the input isn't UCS-16; it's UTF16-LE. > > Do you happen to know where "UCS-16" is defined?
Not right off the bat, but see the Unicode 5.0 spec and http://www.ietf.org/rfc/rfc3629.txt To a first approximation, "UCS-16 ~= Unicode without surrogates". Unicode is at some level a subset of UCS-32 ... UCS-16 plus the characters that can be accessed using surrogate pairs. A UTF is a "UCS Transformation Format" ... an encoding (bit pattern, endianness matters) of what a "Universal Character Set" (UCS) represents (logical "characters" associated with glyphs, associated with numbers). So for example UTF-8 was originally defined as mapping UCS-32 into a sequence of 8-bit bytes, although nowadays its more often treated as support for Unicode, not UTF-32. (See the predecessor of the above RFC to the five and six byte UTF-8 encodings defined.) > > > I remember that issue. I thought that we wanted some kind of escape > > > syntax... Like what HTML uses with &#xxxx; perhaps. This would allow > > > to edit xorg.conf on systems which are not UTF-8 clean. But perhaps > > > it's a non-goal. How big is the code to convert (we need both ways, > > > right)? > > > > How big? Not big. UTF-16 to UTF-8 is a simple algorithm. For > > the reverse, see drivers/usb/gadget/usbstring.c ... the trick is > > you'd need to know enough Unicode to not goof it. Or, to find > > some code that does it right. > > Here's a patch. Anybody see anything wrong with it? I don't have any > devices with non-ASCII characters in the default language descriptors > for testing. It would be nice if there was a library in the kernel to > do these sorts of conversions, but there doesn't appear to be. You could _start_ such a library, in lib/utf8.c or somesuch ... > Nicolas, does it make your life any easier? > > Alan Stern > > > Index: usb-2.6/drivers/usb/core/message.c > =================================================================== > --- usb-2.6.orig/drivers/usb/core/message.c > +++ usb-2.6/drivers/usb/core/message.c > @@ -731,24 +731,71 @@ static int usb_string_sub(struct usb_dev > } > > /** > - * usb_string - returns ISO 8859-1 version of a string descriptor > + * utf16le_to_utf8 - convert a string encoded in UTF-16LE to UTF-8 > + * @dst: the UTF-8 output buffer > + * @dst_len: number of bytes available in @dest > + * @src: the UTF-16LE input buffer > + * @src_len: number of two-byte characters in @src > + * > + * Stores as many completely converted characters from @src as will fit > + * in @dst (i.e., no partial character will remain at the end of @dst). > + * No terminating NULL is appended to @dst. > + * > + * Returns the number of bytes stored in @dst. Or negative error code ... you'll need error exits, see below. > + */ > +static int utf16le_to_utf8(u8 *dst, size_t dst_len, u8 *src, size_t src_len) I'd have expected "__le16 *src", maybe with read_unaligned()... otherwise it'll be impossible to catch errors like wrongly passing in some "__be16 *" data. > +{ > + unsigned c; > + u8 *d, *e1, *e2, *e3; > + > + e1 = dst + dst_len - 1; > + e2 = e1 - 1; > + e3 = e2 - 1; > + for (d = dst; src_len > 0; (--src_len, src += 2)) { > + c = src[0] | (src[1] << 8); > + if (c < 0x80) { > + /* 0******* */ It'd be clearer if you included the Unicode values in the comments, not just the output bytes. So: one byte UTF-8 code points are less than U+0080 ... > + if (d > e1) > + break; > + d[0] = c; > + d += 1; > + } else if (c < 0x800) { > + /* 110***** 10****** */ ... two byte UTF-8 code points are less than U+0800 ... > + if (d > e2) > + break; > + d[0] = 0xc0 | (c >> 6); > + d[1] = 0x80 | (c & 0x3f); > + d += 2; > + } else { > + /* 1110**** 10****** 10****** */ ... three-byte UTF-8 code points are everything else ... ... EXCEPT (!!) for surrogate characters which require the four-byte encodings. If you don't handle these correctly, you need to fail instead of generating the wrong encoding. Conceptually, what you need to do is map the UTF-16 surrogate pairs into their full UCS code points, and then map those into UTF-8. UTF-16 inputs can have errors whereby the surrogate codes are not correcly paired ... you should have an error exit to report bogus UTF-16 format data. I've seen descriptions of security bugs that hide inside such transcoding bugs (mis-handling surrogates, etc). So if this weren't inside the kernel, that might not be a big deal ... See the security issues in the RFC above for some hints about some of the relevant attacks. - Dave > + if (d > e3) > + break; > + d[0] = 0xe0 | (c >> 12); > + d[1] = 0x80 | ((c >> 6) & 0x3f); > + d[2] = 0x80 | (c & 0x3f); > + d += 3; > + } > + } > + return d - dst; > +} > + > +/** > + * usb_string - returns UTF-8 version of a string descriptor > * @dev: the device whose string descriptor is being retrieved > * @index: the number of the descriptor > * @buf: where to put the string > * @size: how big is "buf"? > * Context: !in_interrupt () > * > - * This converts the UTF-16LE encoded strings returned by devices, from > - * usb_get_string_descriptor(), to null-terminated ISO-8859-1 encoded ones > - * that are more usable in most kernel contexts. Note that all characters > - * in the chosen descriptor that can't be encoded using ISO-8859-1 > - * are converted to the question mark ("?") character, and this function > - * chooses strings in the first language supported by the device. > + * This retrieves a UTF-16LE encoded string from a device and converts > + * it to a NULL-terminated UTF-8 encoded string as used by the rest of > + * the kernel. Note that this function chooses strings in the first > + * language supported by the device. > * > * The ASCII (or, redundantly, "US-ASCII") character set is the seven-bit > - * subset of ISO 8859-1. ISO-8859-1 is the eight-bit subset of Unicode, > - * and is appropriate for use many uses of English and several other > - * Western European languages. (But it doesn't include the "Euro" symbol.) > + * subset of UTF-8. Strings containing only ASCII characters appear exactly > + * the same when encoded in UTF-8. Characters (or "code-points") with > + * values above 127 are encoded using multiple bytes. > * > * This call is synchronous, and may not be used in an interrupt context. > * > @@ -758,7 +805,6 @@ int usb_string(struct usb_device *dev, i > { > unsigned char *tbuf; > int err; > - unsigned int u, idx; > > if (dev->state == USB_STATE_SUSPENDED) > return -EHOSTUNREACH; > @@ -794,20 +840,12 @@ int usb_string(struct usb_device *dev, i > if (err < 0) > goto errout; > > - size--; /* leave room for trailing NULL char in output buffer */ > - for (idx = 0, u = 2; u < err; u += 2) { > - if (idx >= size) > - break; > - if (tbuf[u+1]) /* high byte */ > - buf[idx++] = '?'; /* non ISO-8859-1 character */ > - else > - buf[idx++] = tbuf[u]; > - } > - buf[idx] = 0; > - err = idx; > + err = utf16le_to_utf8(buf, size - 1, &tbuf[2], (err - 2) / 2); > + buf[err] = 0; > > if (tbuf[1] != USB_DT_STRING) > - dev_dbg(&dev->dev, "wrong descriptor type %02x for string %d > (\"%s\")\n", tbuf[1], index, buf); > + dev_dbg(&dev->dev, "wrong descriptor type %02x for string " > + "%d (\"%s\")\n", tbuf[1], index, buf); > > errout: > kfree(tbuf); > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel