Hi Sadahiro Having incorporated the changes in the doop.c and op.c I strangely get lots of failures and here are the test results. Seems like the first approach itself fails on tr// and there will certainly more failures when we run the entire test suite which uses these functions. In the second approach, the change seems to be affecting only tr// . Please let me know your suggestions for the changes which I can apply in S_scan_const() and see if it works.
regards Sastry # Failed at t/op/tr.t line 110 # got 'š\'' Wide character in print at ./test.pl line 48. # expected 'Œã\'' # Failed at t/op/tr.t line 209 Wide character in print at ./test.pl line 48. # got '¯œD–㯜D–ã' Wide character in print at ./test.pl line 48. # expected '¯œ¯Û–㯜¯Û–ã' # Failed at t/op/tr.t line 219 # got 'CDÚCDÚ' Wide character in print at ./test.pl line 48. # expected 'C¯Û–ãC¯Û–ã' # Failed at t/op/tr.t line 224 Wide character in print at ./test.pl line 48. # got 'ED–ãED–㌨Føã' Wide character in print at ./test.pl line 48. # expected 'E¯Û[E¯Û[Œ¨Føã' # Failed at t/op/tr.t line 234 Wide character in print at ./test.pl line 48. # got '¯Û¯Û¯Û¯Û¯Û¯Û' Wide character in print at ./test.pl line 48. # expected '¯ÛD¯Û¯ÛD¯Û' # Failed at t/op/tr.t line 283 Wide character in print at ./test.pl line 48. # got '¯œD–㯥E–ã' Wide character in print at ./test.pl line 48. # expected '¯œ¯œ–㯥¯Û–ã' # Failed at t/op/tr.t line 350 # got '§ÿ' Wide character in print at ./test.pl line 48. # expected 'ΰÎ"' 1..99 ok 1 - uc ok 2 - lc ok 3 - partial uc ok 4 - EBCDIC discontinuity ok 5 - tr cancels IOK and NOK ok 6 - harmless if explicitly not updating ok 7 - harmless if implicitly not updating ok 8 - no error ok 9 - handles UTF8 ok 10 ok 11 ok 12 ok 13 ok 14 ok 15 ok 16 ok 17 - changing UTF8 chars in a UTF8 string, same length ok 18 ok 19 - more bytes ok 20 not ok 21 - Putting UT8 chars into a non-UTF8 string ok 22 ok 23 - Removing UTF8 chars from UTF8 string ok 24 ok 25 - Counting UTF8 chars in UTF8 string ok 26 - non-UTF8 chars in UTF8 string ok 27 - UTF8 chars in non-UTFs string ok 28 - tr/a-z-9// ok 29 - hyphens, leading ok 30 - trailing ok 31 - both ok 32 ok 33 ok 34 ok 35 - reversed range check ok 36 - cannot update read-only var ok 37 - explicit read-only count ok 38 - no error ok 39 - implicit read-only count ok 40 - no error ok 41 - LHS of non-updating tr ok 42 - LHS bad on updating tr ok 43 - byte2byte transliteration ok 44 ok 45 ok 46 not ok 47 - byte2wide transliteration ok 48 - wide2byte ok 49 - wide2wide not ok 50 - byte2wide & wide2byte not ok 51 - all together now! ok 52 - transliterate and count ok 53 not ok 54 - translit w/complement ok 55 ok 56 - translit w/deletion ok 57 ok 58 - translit w/squeeze ok 59 ok 60 ok 61 ok 62 ok 63 - UTF range not ok 64 ok 65 ok 66 ok 67 ok 68 ok 69 ok 70 ok 71 ok 72 ok 73 ok 74 ok 75 ok 76 ok 77 ok 78 ok 79 ok 80 ok 81 ok 82 not ok 83 ok 84 ok 85 ok 86 ok 87 ok 88 - pp_trans needs to unshare shared hash keys ok 89 - no error ok 90 - implicit count on constant ok 91 - no error ok 92 - implicit count outside array bounds, index negative ok 93 - doesn't extend the array ok 94 - implicit count outside array bounds, index positive ok 95 - doesn't extend the array ok 96 - implicit count outside hash bounds ok 97 - doesn't extend the hash ok 98 - non-modifying tr/// on a scalar ref ok 99 - doesn't stringify its argument On 9/14/05, SADAHIRO Tomoyuki <[EMAIL PROTECTED]> wrote: > > On Wed, 14 Sep 2005 16:50:26 +0530, Sastry <[EMAIL PROTECTED]> wrote > > > Hi Sadahiro > > > > On 9/12/05, SADAHIRO Tomoyuki <[EMAIL PROTECTED]> wrote: > > > > > > I attribute the failure in tr/\x{12c}-\x{130}/\xc0-\xc4/; to > > > such an ambiguity of \xc0-\xc4. In this expression the left part > > > \x{12c}-\x{130} parsed before coerces \xc0-\xc4 into Unicode, > > > and results in the failure. > > So this is still a problem on EBCDIC! Is there a way to fix this? > > > > #test case B # On ASCII platform, of course successful > > > $c = ($a = "\x89\x8a\x8b\x8c\x8d\x8f\x90\x91") =~ tr/\x{100}\x89-\x91/X/; > > > is($c, 8); > > > is($a, "XXXXXXXX"); > > This test fails on EBCDIC. In S_scan_const(), there is a statement below. > > /* Insert oct or hex escaped character. > > * There will always enough room in sv since such > > * escapes will be longer than any UTF-8 sequence > > * they can end up as. */ > > > > /* We need to map to chars to ASCII before doing the tests > > to cover EBCDIC > > */ > > if (!UNI_IS_INVARIANT(NATIVE_TO_UNI(uv))) { > > if (!has_utf8 && uv > 255) { > > > > on an ASCII , the first if condition is true as uv is 137 and it > > falls in the variant range as uv >\x7F whereas on EBCDIC the if > > condition is false. Can you explain why this behaviour is? > > see "else" for this "if." This condition tests whether uv needs > multiple octets in UTF-8/UTF-EBCDIC or only needs a single octet. > "\x89" in Latin-1 corresponds to a double-octet representation > in UTF-8, and true (that needs multiple octets) on ASCII platform. > "\x89" in EBCDIC corresponds to a single-octet representation > in UTF-EBCDIC, and false on EBCDIC platform. > > Where "else" runs, there is no difference between ASCII and UTF-8; > or between single-octet EBCDIC and UTF-EBCDIC. > > > Also I found that the characters are expanded during runtime in > > S_do_trans_simple_utf8() > > If I understand it correctly, expansion of character ranges isn't > performed in do_trans_simple_utf8(). It is performed in scan_const() > for non-Unicode and pmtrans() for Unicode. > > > Do you have any suggestion where the problem is? > > (1) one way (I think worse) > Perl should treat the range in the native order (not in Unicode one) > through the parse time, the compile time, and the run time. > > using uvchr_to_utf8() instead of uvuni_to_utf8(), > utf8n_to_uvchr() instead of utf8n_to_uvuni(), > in op.c#pmtrans and doop.c#do_trans_simple_utf8 etc. > > But swash_fetch() also needs change (the current swash does not > know EBCDIC, only Unicode); changes of swash may lead to > corruption of lc(), uc(), regular expression \p{something} etc. > > (2) another way (I think better) > No change of swash, pmtrans, do_trans_****. > > Then all character ranges within 0..255 (not only for non-Unicode > but also for Unicode) to be expanded in scan_const(). > (and pmtrans() will expand only uv >= 256). > > I think this way requires only the change of toke.c#scan_const > and influences only tr///. > > But the change will be quite big, since the current scan_const() > only expands non-Unicode and assumes a single octet encoding. > The range 0..255 in UTF-8/UTF-EBCDIC includes double-octet characters. > > I'm not sure whether such a change should be enclosed > with #ifdef EBCDIC and #endif > > Regards, > SADAHIRO Tomoyuki > > >