On Tue, 5 Jun 2007, David Brownell wrote: > > Date: Tue, 5 Jun 2007 17:00:56 -0400 (EDT) > > From: Alan Stern <[EMAIL PROTECTED]> > > > > > > > Does anybody think it would be worthwhile to convert string > > > > > descriptors > > > > > from UCS-16 to UTF-8 (instead of Latin1) when we read them in? > > > > > > Or even UTF-7 ... ? FWIW the input isn't UCS-16; it's UTF16-LE. > > > > Do you happen to know where "UCS-16" is defined? > > Not right off the bat, but see the Unicode 5.0 spec and > > http://www.ietf.org/rfc/rfc3629.txt > > To a first approximation, "UCS-16 ~= Unicode without surrogates". > Unicode is at some level a subset of UCS-32 ... UCS-16 plus the > characters that can be accessed using surrogate pairs. > > A UTF is a "UCS Transformation Format" ... an encoding (bit pattern, > endianness matters) of what a "Universal Character Set" (UCS) represents > (logical "characters" associated with glyphs, associated with numbers). > > So for example UTF-8 was originally defined as mapping UCS-32 into a > sequence of 8-bit bytes, although nowadays its more often treated as > support for Unicode, not UTF-32. (See the predecessor of the above RFC > to the five and six byte UTF-8 encodings defined.)
Thanks for the explanation. I see from the RFC that the correct names are UCS-2 and UCS-4, not UCS-16 and UCS-32. It's a shame that so few parts of the Unicode standard are freely available. > > Here's a patch. Anybody see anything wrong with it? I don't have any > > devices with non-ASCII characters in the default language descriptors > > for testing. It would be nice if there was a library in the kernel to > > do these sorts of conversions, but there doesn't appear to be. > > You could _start_ such a library, in lib/utf8.c or somesuch ... I'll look into it. The actual conversion code was taken from elsewhere in the kernel, so starting a library isn't a bad idea. > > + */ > > +static int utf16le_to_utf8(u8 *dst, size_t dst_len, u8 *src, size_t > > src_len) > > I'd have expected "__le16 *src", maybe with read_unaligned()... > otherwise it'll be impossible to catch errors like wrongly > passing in some "__be16 *" data. Or some other type-safe approach. For a library it's a little awkward to have duplicate little-endian and big-endian versions of everything. I did it this way here because that's how usb_string() passes its arguments. Changing to "__le16 *src" would require a cast in the caller, which kind of defeats the purpose. But at least it is explicit. > > +{ > > + unsigned c; > > + u8 *d, *e1, *e2, *e3; > > + > > + e1 = dst + dst_len - 1; > > + e2 = e1 - 1; > > + e3 = e2 - 1; > > + for (d = dst; src_len > 0; (--src_len, src += 2)) { > > + c = src[0] | (src[1] << 8); > > + if (c < 0x80) { > > + /* 0******* */ > > It'd be clearer if you included the Unicode values in the > comments, not just the output bytes. So: one byte UTF-8 > code points are less than U+0080 ... Okay. > > + if (d > e1) > > + break; > > + d[0] = c; > > + d += 1; > > + } else if (c < 0x800) { > > + /* 110***** 10****** */ > > ... two byte UTF-8 code points are less than U+0800 ... > > > + if (d > e2) > > + break; > > + d[0] = 0xc0 | (c >> 6); > > + d[1] = 0x80 | (c & 0x3f); > > + d += 2; > > + } else { > > + /* 1110**** 10****** 10****** */ > > ... three-byte UTF-8 code points are everything else ... > > ... EXCEPT (!!) for surrogate characters which require the > four-byte encodings. If you don't handle these correctly, > you need to fail instead of generating the wrong encoding. > > Conceptually, what you need to do is map the UTF-16 surrogate > pairs into their full UCS code points, and then map those into > UTF-8. UTF-16 inputs can have errors whereby the surrogate > codes are not correcly paired ... you should have an error > exit to report bogus UTF-16 format data. I'll add it. It's worth noting that the other existing routines from which I borrowed this algorithm don't handle surrogate pairs. > I've seen descriptions of security bugs that hide inside such > transcoding bugs (mis-handling surrogates, etc). So if this > weren't inside the kernel, that might not be a big deal ... > > See the security issues in the RFC above for some hints about > some of the relevant attacks. > Another security risk: emitting a NUL before the end of the > string. Treat UTF-16 inputs like { 'a', 'b', 0, 'c', 'd' } > as errors, don't just return "ab\0cd". It's hard to know what to make of this one. In some sense it isn't a fault of the conversion routine, because it is documented as returning a byte array and a length -- not a NUL-terminated string. However the caller is liable to use it as a string (like usb_string() does), so your suggestion seems to be the safest course. Another potential error is truncation because the destination buffer is too short. Since snprintf() doesn't treat truncation as an error, I guess this routine shouldn't either. Alan Stern ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel