On Wednesday 06 June 2007, Alan Stern wrote: > It turns out that fs/nls/nls_base.c already includes conversion > routines such as utf8_wcstombs(). Unforunately they are not an ideal > match to what we want for several reasons: > > ...
One of the ongoing headaches of I18N work is backwards compat with old bugs. Even ones with security ramifications... and basic conceptual goofs, like assuming that a "character" will fit into a single "char" (or more recently, 16-bit "wchar_t"). > A few other source files, such as drivers/s390/char/keyboard.c, include > their own home-brewed UTF-8 conversions. In some cases the converted > characters aren't stored in a buffer; they are passed to various inline > routines. In no cases are surrogate pairs handled correctly. Correct handling of surrogates has always been a problem. Just like having correct utilities to build on has been... It's stuff like this which makes companies create job titles like "Internationalization Guru", and give them enough clout to go fix all the little fires burning all over the place. I noticed a few years ago that nobody in Linux had declared themselves the kernel I18N guru and done so. It's sort of understandable -- it's a *huge* area, less thankful than most, and folk with the relevant knowledge tend not to have any street cred among kernel folk -- but I'm surprised that even the basics like UTF-{8,16LE} are as broken as you report. That's the _easy_ stuff! > In fact, as far as I can see no part of the kernel is prepared to > handle Unicode values higher than U+ffff. The idea is that they shouldn't need to know that's what they're doing ... because proper string handling ensures any characters in the "astral planes" are transparently passed through. (Such policies are what make Unicode more than just a character set.) That relies on correct transcoding though. And some of that code you pointed to clearly dates from very early implementations, before people became very aware of the security (including integrity) consequences of some of those shortcuts ... like assuming for example that no characters need more than 16 bits. The problem with characters using the astral planes is right now mostly that they're an attack vector because of buggy implementations ... there aren't many real users of those characters. Rejecting surrogates would be a **MUCH** better strategy than mis-handling them. > So the situation is a mess. I'm not even sure what features a library > ought to provide. Something from the following selection: > > Store output in a buffer or send it to an output routine. My answer would be to adopt C strings in UTF-8 as the standard representation. Output logic could translate a few characters at a time and send them along, if for some reason it's got to take UCS-16 (or Unicode) inputs. > Native byte-order, little-endian, or big-endian. Native order is IMO not much use, unless you're trying to use Unicode native; it's best as an external representation. At least in kernel terms. > Return an error for invalid codes or ignore them. Error is safest... but remember, even with Unicode it's going to be "invalid strings". But also: if you're given a filesystem with names stored in Unicode, unpaired surrogates are errors I'd expect FSCK would handle ... but preventing access to those files would likely be a Bad Thing. > Stop at NUL or take a length argument. Both options are needed. > Any suggestions for the best way to organize all this? As I suggested before: core transcoding routines which are actually correct, and have options for those features above. Then provide simple wrappers for primary option combinations. - Dave ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel