tags 418058 + unreproducible thanks Osamu Aoki a écrit : > Package: libc6 > Version: 2.3.6.ds1-13 > Severity: important > > Problem: ~ ' \ conversion. > > In short, iconv should not to smart guessing for 7 bit section of each > traditional encodings which was ASCII compatible. They should be same > in that 7 bit section. > > Here we go.... > > For all popular C/perl/shell/... programs written originally in latin-1, > latin-2, ..., shift-jis, euc-jp, ... encodings will break if iconv is > used to convert them in UTF-8. iconv does half-smart job to please some > cosmetic factors but forgot about how these encodings were originally > developed and used in real life so it is harmful to the data. (Of > course those funny 8 bit texts are in the comments and the quoted text) > > In this sense, I could file grave bug for breaking data but considering > timing, I stay with important. (After etch, I may raise this bug > severity.) > > All these encodings (latin-1, latin-2, ..., shift-jis, euc-jp, ... ) > were developed so non-ASCII characters can be expressed without breaking > existing tools/codes developped for ASCII. That is why they are ASCII > compatible. All 0x00-0x7f (7bit) represented characters shared the same > position (We do use alternative font for the ASCII 0x5c = back_lash = > '\' in Japan which looks like Japanese Yen-mark, but these \ in ASCII > and yen in shift-jis serves the same purpose in the program world. C > standard even mention about dual nature of \.) So by changing encoding > of the file, we expect all 0x00-0x7f (7bit) to remain the same. > > But I iconv does many funny things. > > The code 0x27 (single-quote) is changed to something else (long UTF-8 > sequence for single-quote) when converted from any of latin-1, latin-2, > shift-jis, euc-jp,... to UTF-8 changes. This is not expected. > > For shift-jis, it is even worse. iconv tries to map character 0x5c to > UTF-8 YEN mark. That mapping should be done for the yen mark code in > 16bit (full width character section) and not for this 7 bit one. This > is very bad for any program. Another issue is 0x7e '~'. This is > translated to upper bar. Although some Japanese old PC (pre-IBM > compatible, NEC 98 machines, I think) had upper bar shaped font for ~, > converting this ~ in data to UTF-8 upper bar breaks URLs data stored on > shift-jis machines. > > The choice of conversion table should not be based on superficial shape > caparison but should take into full account of actual usage and > implication. > > iconv being basic tool, it should not do these conversion on 7 bit code > for these. If anyone want syntactical pretty print conversion of UTF-8 > text, it should rely on some other tool. Then they can use open and > closing quote if they wish. But we can keep C programs right. Many > old C programs in each locale used to use these ASCII compatible > encodings and all we want to do is convert quoted text and comments to > UTF-8.
All the diff you provide are actually wrong. In all those file, the input character for ' is not 27 but E2 80 99, which is an UTF-8 sequence. iconv behaves correctly here. Please provide us a correct input file (check it with hexdump) that exhibits the problem. I suggests to gzip it to avoid encoding translation by your MUA. -- .''`. Aurelien Jarno | GPG: 1024D/F1BCDB73 : :' : Debian developer | Electrical Engineer `. `' [EMAIL PROTECTED] | [EMAIL PROTECTED] `- people.debian.org/~aurel32 | www.aurel32.net -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

