On Wednesday 06 June 2007, Alan Stern wrote:
> On Tue, 5 Jun 2007, David Brownell wrote:
> 
> > > Date: Tue, 5 Jun 2007 17:00:56 -0400 (EDT)
> > > From: Alan Stern <[EMAIL PROTECTED]>
> > >
> > > > > > Does anybody think it would be worthwhile to convert string 
> > > > > > descriptors 
> > > > > > from UCS-16 to UTF-8 (instead of Latin1) when we read them in?
> > > > 
> > > > Or even UTF-7 ... ?   FWIW the input isn't UCS-16; it's UTF16-LE.
> > >
> > > Do you happen to know where "UCS-16" is defined?
> > 
> > Not right off the bat, but see the Unicode 5.0 spec and
> > 
> >    http://www.ietf.org/rfc/rfc3629.txt
> > 
> > To a first approximation, "UCS-16 ~= Unicode without surrogates".
> > Unicode is at some level a subset of UCS-32 ... UCS-16 plus the
> > characters that can be accessed using surrogate pairs.
> > ...
> 
> Thanks for the explanation.  I see from the RFC that the correct names 
> are UCS-2 and UCS-4, not UCS-16 and UCS-32.

The popularity of various terms and concepts has changed over time.
Like the notion of UTF-8 being based on the full UCS, not Unicode.


> It's a shame that so few parts of the Unicode standard are freely
> available.

If you do I18N work, you need a copy of that spec and many others.
Otherwise, you can just borrow one for a while.  Older copies are
still mostly useful too.

 

> > > +static int utf16le_to_utf8(u8 *dst, size_t dst_len, u8 *src, size_t 
> > > src_len)
> > 
> > I'd have expected "__le16 *src", maybe with read_unaligned()...
> > otherwise it'll be impossible to catch errors like wrongly
> > passing in some "__be16 *" data.
> 
> Or some other type-safe approach.  For a library it's a little awkward 
> to have duplicate little-endian and big-endian versions of everything.

Mechanically, that is ... not that UTF16-BE is all that common;
it's not clear we'd need to support it.  Maybe some filesystem
would use that someday.


> I did it this way here because that's how usb_string() passes its 
> arguments.  Changing to "__le16 *src" would require a cast in the 
> caller, which kind of defeats the purpose.  But at least it is 
> explicit.

And it would actually reflect what USB specifies that way, too.


> > > +         } else {
> > > +                 /* 1110**** 10****** 10****** */
> > 
> > ... three-byte UTF-8 code points are everything else ...
> > 
> > ... EXCEPT (!!) for surrogate characters which require the
> > four-byte encodings.  If you don't handle these correctly,
> > you need to fail instead of generating the wrong encoding.
> > 
> > Conceptually, what you need to do is map the UTF-16 surrogate
> > pairs into their full UCS code points, and then map those into
> > UTF-8.  UTF-16 inputs can have errors whereby the surrogate
> > codes are not correcly paired ... you should have an error
> > exit to report bogus UTF-16 format data.
> 
> I'll add it.  It's worth noting that the other existing routines from 
> which I borrowed this algorithm don't handle surrogate pairs.

It's not an uncommon bug.  But it's a bug, and is the sort
of thing that security audits of I18N code will notice.


> > Another security risk:  emitting a NUL before the end of the
> > string.  Treat UTF-16 inputs like { 'a', 'b', 0, 'c', 'd' }
> > as errors, don't just return "ab\0cd".
> 
> It's hard to know what to make of this one.  In some sense it isn't a
> fault of the conversion routine, because it is documented as returning
> a byte array and a length -- not a NUL-terminated string.  However the
> caller is liable to use it as a string (like usb_string() does), so
> your suggestion seems to be the safest course.

Right ... counted strings are not the native model, and even the
case of "as much as fits in this buffer" is error-prone.


> Another potential error is truncation because the destination buffer is 
> too short.  Since snprintf() doesn't treat truncation as an error, I 
> guess this routine shouldn't either.

But this isn't snprintf().  Related issue of course is how you'd
handle truncation in the middle of a multibyte UTF-8 character code.

A string with lots of East-Asian characters ("CJKV", meaning
Chinese/Japanese/Korean/Vietnames) taking N bytes of UTF-16LE
will often expand by up to 50%, so running out of space isn't
an idle concern.

It might be safest to switch between two behaviors based on a
parameter ... so you could tell whether a return of "dst_len"
represented buffer-full, or not.  In fact, I figure there are
several such behaviors to care about, and probably there ought
to be a core transcoding routine with a few convenience wrappers
around it.

- Dave



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
linux-usb-devel@lists.sourceforge.net
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel

Reply via email to