On Mon, 4 Jun 2007, David Brownell wrote:
> On Monday 04 June 2007, Pete Zaitcev wrote:
> > On Mon, 4 Jun 2007 16:52:01 -0400 (EDT), Alan Stern <[EMAIL PROTECTED]>
> > wrote:
>
> > > Does anybody think it would be worthwhile to convert string descriptors
> > > from UCS-16 to UTF-8 (instead of Latin1) when we read them in?
>
> Or even UTF-7 ... ? FWIW the input isn't UCS-16; it's UTF16-LE.
Do you happen to know where "UCS-16" is defined?
> > I remember that issue. I thought that we wanted some kind of escape
> > syntax... Like what HTML uses with &#xxxx; perhaps. This would allow
> > to edit xorg.conf on systems which are not UTF-8 clean. But perhaps
> > it's a non-goal. How big is the code to convert (we need both ways,
> > right)?
>
> How big? Not big. UTF-16 to UTF-8 is a simple algorithm. For
> the reverse, see drivers/usb/gadget/usbstring.c ... the trick is
> you'd need to know enough Unicode to not goof it. Or, to find
> some code that does it right.
Here's a patch. Anybody see anything wrong with it? I don't have any
devices with non-ASCII characters in the default language descriptors
for testing. It would be nice if there was a library in the kernel to
do these sorts of conversions, but there doesn't appear to be.
Nicolas, does it make your life any easier?
Alan Stern
Index: usb-2.6/drivers/usb/core/message.c
===================================================================
--- usb-2.6.orig/drivers/usb/core/message.c
+++ usb-2.6/drivers/usb/core/message.c
@@ -731,24 +731,71 @@ static int usb_string_sub(struct usb_dev
}
/**
- * usb_string - returns ISO 8859-1 version of a string descriptor
+ * utf16le_to_utf8 - convert a string encoded in UTF-16LE to UTF-8
+ * @dst: the UTF-8 output buffer
+ * @dst_len: number of bytes available in @dest
+ * @src: the UTF-16LE input buffer
+ * @src_len: number of two-byte characters in @src
+ *
+ * Stores as many completely converted characters from @src as will fit
+ * in @dst (i.e., no partial character will remain at the end of @dst).
+ * No terminating NULL is appended to @dst.
+ *
+ * Returns the number of bytes stored in @dst.
+ */
+static int utf16le_to_utf8(u8 *dst, size_t dst_len, u8 *src, size_t src_len)
+{
+ unsigned c;
+ u8 *d, *e1, *e2, *e3;
+
+ e1 = dst + dst_len - 1;
+ e2 = e1 - 1;
+ e3 = e2 - 1;
+ for (d = dst; src_len > 0; (--src_len, src += 2)) {
+ c = src[0] | (src[1] << 8);
+ if (c < 0x80) {
+ /* 0******* */
+ if (d > e1)
+ break;
+ d[0] = c;
+ d += 1;
+ } else if (c < 0x800) {
+ /* 110***** 10****** */
+ if (d > e2)
+ break;
+ d[0] = 0xc0 | (c >> 6);
+ d[1] = 0x80 | (c & 0x3f);
+ d += 2;
+ } else {
+ /* 1110**** 10****** 10****** */
+ if (d > e3)
+ break;
+ d[0] = 0xe0 | (c >> 12);
+ d[1] = 0x80 | ((c >> 6) & 0x3f);
+ d[2] = 0x80 | (c & 0x3f);
+ d += 3;
+ }
+ }
+ return d - dst;
+}
+
+/**
+ * usb_string - returns UTF-8 version of a string descriptor
* @dev: the device whose string descriptor is being retrieved
* @index: the number of the descriptor
* @buf: where to put the string
* @size: how big is "buf"?
* Context: !in_interrupt ()
*
- * This converts the UTF-16LE encoded strings returned by devices, from
- * usb_get_string_descriptor(), to null-terminated ISO-8859-1 encoded ones
- * that are more usable in most kernel contexts. Note that all characters
- * in the chosen descriptor that can't be encoded using ISO-8859-1
- * are converted to the question mark ("?") character, and this function
- * chooses strings in the first language supported by the device.
+ * This retrieves a UTF-16LE encoded string from a device and converts
+ * it to a NULL-terminated UTF-8 encoded string as used by the rest of
+ * the kernel. Note that this function chooses strings in the first
+ * language supported by the device.
*
* The ASCII (or, redundantly, "US-ASCII") character set is the seven-bit
- * subset of ISO 8859-1. ISO-8859-1 is the eight-bit subset of Unicode,
- * and is appropriate for use many uses of English and several other
- * Western European languages. (But it doesn't include the "Euro" symbol.)
+ * subset of UTF-8. Strings containing only ASCII characters appear exactly
+ * the same when encoded in UTF-8. Characters (or "code-points") with
+ * values above 127 are encoded using multiple bytes.
*
* This call is synchronous, and may not be used in an interrupt context.
*
@@ -758,7 +805,6 @@ int usb_string(struct usb_device *dev, i
{
unsigned char *tbuf;
int err;
- unsigned int u, idx;
if (dev->state == USB_STATE_SUSPENDED)
return -EHOSTUNREACH;
@@ -794,20 +840,12 @@ int usb_string(struct usb_device *dev, i
if (err < 0)
goto errout;
- size--; /* leave room for trailing NULL char in output buffer */
- for (idx = 0, u = 2; u < err; u += 2) {
- if (idx >= size)
- break;
- if (tbuf[u+1]) /* high byte */
- buf[idx++] = '?'; /* non ISO-8859-1 character */
- else
- buf[idx++] = tbuf[u];
- }
- buf[idx] = 0;
- err = idx;
+ err = utf16le_to_utf8(buf, size - 1, &tbuf[2], (err - 2) / 2);
+ buf[err] = 0;
if (tbuf[1] != USB_DT_STRING)
- dev_dbg(&dev->dev, "wrong descriptor type %02x for string %d
(\"%s\")\n", tbuf[1], index, buf);
+ dev_dbg(&dev->dev, "wrong descriptor type %02x for string "
+ "%d (\"%s\")\n", tbuf[1], index, buf);
errout:
kfree(tbuf);
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
[email protected]
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel