On Fri, Feb 16, 2007 at 05:23:01PM +0100, Thomas Moschny wrote: > On Freitag, 16. Februar 2007, Lapo Luchini wrote: > > on Fedora, with libiconv bundled inside libc: > > % echo "\xE3\x83\x9D" | iconv -f UTF-8 -t ASCII//IGNORE//TRANSLIT > > iconv: illegal input sequence at position 4 > > % echo "\xE3\x83\x9D" | iconv -f UTF-8 -t ASCII//IGNORE > > iconv: illegal input sequence at position 4 > > % echo "\xE3\x83\x9D" | iconv -f UTF-8 -t ASCII//TRANSLIT > > ? > > Order of the modifiers seems to matter. > > On Fedora Core 6: > % echo "\xE3\x83\x9D" | iconv -f UTF-8 -t ASCII//TRANSLIT//IGNORE > ?
There's something seriously odd going on with //IGNORE as well. Notice the "position 4" there. On FC1, I get: fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//IGNORE ab iconv: illegal input sequence at position 6 i.e., it seems to actually translate everything correctly, then throw a bogus error upon reaching end-of-string. For completeness: fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//TRANSLIT ?ab fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//TRANSLIT//IGNORE ?ab fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//IGNORE//TRANSLIT ab iconv: illegal input sequence at position 6 So in all the //foo//bar cases, it actually acts like the second //bar isn't even there. fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII iconv: illegal input sequence at position 0 fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//IGNORE,TRANSLIT iconv: illegal input sequence at position 0 fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//TRANSLIT,IGNORE iconv: illegal input sequence at position 0 With comma, the outward behavior is the same as if the //foo isn't there at _all_... given that the iconv manual actually just documents that you can use //IGNORE or //TRANSLIT, it's possible that once upon a time there was no comma parsing at all? Dunno. It doesn't give an error on other unrecognized modifiers, either: fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//ASDF iconv: illegal input sequence at position 0 On mostly-current debian sid, the comma stuff and TRANSLIT seem to work: sid$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII iconv: illegal input sequence at position 0 sid$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//TRANSLIT ?ab sid$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//TRANSLIT,IGNORE ?ab sid$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//IGNORE,TRANSLIT ?ab But the weird //IGNORE error is still there: sid$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//IGNORE ab iconv: illegal input sequence at position 6 Not sure if this is a bug, or just something odd in the iconv command line tool -- perhaps it is perfectly expected that if you use //IGNORE, iconv will work correctly and then set errno to something to say "hey, I totally had errors that I ignored, just so you know". Again, if you use //foo//bar, then it acts the same as if you had only passed //foo and left off //bar: sid$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//IGNORE//TRANSLIT ab iconv: illegal input sequence at position 6 sid$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//TRANSLIT//IGNORE ?ab I'm not really sure why it works this way; looking at gconv_open.c in the glibc sources, AFAICT it should simply fail to understand this bizarre "TRANSLIT//ENCODE" error handling specification entirely and ignore it. But my skills at reading string-parsing-in-C code are pretty rusty. So, ummm... in conclusion. //IGNORE actually seems like it is working correctly and usefully, just with an unexpected API. //TRANSLIT works pretty okay too. But mostly we've only tested with GNU iconv -- I have no idea what's going to happen on, say, OSX or *BSD or Solaris. One option is just to write our own "//IGNORE"-style iconv wrapper. iconv's normal API is that it does as much work as it can, then it tells you where it bombed out. It's perfectly possible at that point to skip ahead a byte or more on the input, stick a question mark in the output string, and then try again from there. Not the most efficient thing in the world, but probably a lot easier than trying to ship iconv conversion tables. -- Nathaniel -- Electrons find their paths in subtle ways. _______________________________________________ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel