jhi, On 2002.02.22, at 07:54, Jarkko Hietaniemi wrote: > Hi, > > the new JP test and the old Tcl test are doing "somewhat okay" in EBCDIC > (I'm using an OS/390 mainframe).
I wish I had an access to it... > Failed Test Stat Wstat Total Fail Failed List > of Failed > ------------------------------------------------------------------------------- > ... > ../ext/Encode/t/JP.t 255 65280 22 16 72.73% 7-22 > ../ext/Encode/t/Tcl.t 137 35072 632 34 5.38% > 592-598 600 > 602 > 604 606 > 608 > 610 612- > 632 > > My problem is what to do about these failures. Especially the Tcl.t > is rather frustratingly close to success. The JP.t might be a hard > nut to crack. Should I just skip the failing tests? If so, we need > to figure out what is the pattern of the failures (hardcording by test > numbers would feel really evil...)? We might entertain the idea of > completely skipping these tests, but the relatively high success rate > seems to be saying that fixing this instead of ignoring this might be > possible. I am yet to grok your test to the fullest extent but this much I can't tell; Don't let the high success rate foo you; Remember 8bit part is much smaller compared to 16bit part. If your tests attempts something like "feed an UTF-EBCDIC to a given encoding, decode it back and see if it matches the original", chances are MOST iso-8859-1 part is failing. But once again, I am yet to check in full detail. > Dan, in case EBCDIC scares you (and it should :-), a quick intro: > basically, consider the whole low 256 characters being rearranged from > what they are in ASCII. For example, ord("A") is 0xC1, not 0x41. (The > pod/perlebcdic.pod has the full tables.) Sure it does scare me. I have to confess UTF-EBCDIC was totally out of mind. But here I got a hint; Like what perl used to be, CJK encodings are very, very ASCII-chauvinistic; Its variable-length encoding heavily relies on the fact that ascii leaves MSB of the byte alone. That way you can tell if a given byte is a whole (half-width) character or half of full-width character. The shadow of ASCII casts even on ISO-2022, an escape-based encoding that is not supposed to be affected by MSB and such (Only \e was supposed to matter); in ISO-2022, most 2-byte characters are represented by either 96x96 or 94x94 grid, which is (7bit ascii - control characters) or (that - space (0x20) and DEL (\x7F)). Obviously this will not work on EBCDIC.... This one may be tougher than we think.... FYI I know something called 12-bit EBCDIC kanji also exists. I know only of existence but is that in our support list? > The test logs are attached: I would really appreciate if you could see > some pattern in the failures. I will do the best I can but I will be away for this weekend and I won't be back online till Sunday at least. > -- > $jhi++; # http://www.iki.fi/jhi/ > # There is this special biologist word we use for 'stable'. > # It is 'dead'. -- Jack Cohen Dan the Unstable according to Jack Cohen