On Sat, Dec 15, 2001 at 01:56:19AM -0500, Michael G Schwern wrote:
> On Sat, Dec 15, 2001 at 02:41:13AM -0500, Barrie Slaymaker wrote:
> > There's also a test script for the escaping now :).  I'd love to test
> > characters for codepoints > 0xff, but they're borken / unsupported in
> > all released perls, I think.
> 
> Broken is such an ugly word.  "Experimental" :)

:)

> 5.6.1's unicode implementation should be usable enough for your
> purposes.  You might want to have a file full of Unicode tests that
> only runs if you've got a Perl >= 5.6.1

Well, when trying to descape chr( 0x0100 ) and chr( 0xfffd ) without
"use utf8", I get:

   not ok 12
   # Test 12 got: '\xc4\x80' (t/00escape.t at line 23)
   #    Expected: '\x{0100}'
   not ok 13
   # Test 13 got: '\xef\xbf\xbd' (t/00escape.t at line 24)
   #    Expected: '\x{fffd}'

When I "use utf8" in Differences.pm and try to escape chr( 0xff ), I get:

   ok 10
   Malformed UTF-8 character (byte 0xff) in substitution (s///) at
   blib/lib/Test/Differences.pm line 216.
   Malformed UTF-8 character (byte 0xff) in ord at
   blib/lib/Test/Differences.pm line 216.
   Segmentation fault (core dumped)

The core dump is speaking to me.  When I channel it, a low, demented
voice (well, more demented than normal) comes from me saying "DON'T USE
UTF8 HERE".  So I just don't test for that and I don't "use utf8".  I
haven't had time to poke and prod at it to see if I can work around it.

Here's a small demo of the "Malformed" error on chr(0xff) from test 11:

   $ perl -le '          print join ",", map sprintf( "%02x", ord), chr( 0xff ) =~ 
/([\000])/'

   $ perl -le 'use utf8; print join ",", map sprintf( "%02x", ord), chr( 0xff ) =~ 
/([\000])/'
   Malformed UTF-8 character (byte 0xff) in pattern match (m//) at -e line 1.

It really seems to be matching against


And here's a one liner for the core dump:

   $ perl -le 'use utf8; ( $s = chr( 0xff )) =~ s/([^\000])/sprintf "f"/e'
   Malformed UTF-8 character (byte 0xff) in substitution (s///) at -e line 216.
   Segmentation fault (core dumped)

Throwing a split at it seems to be better than a regexp if I use utf8:

   $ perl -le 'use utf8; print join ",", map sprintf( "%02x", ord), ( split //, chr( 
0xff ) . chr( 0x100 ) . chr( 0xfffd ) . chr( 0xffff ) )'
   Malformed UTF-8 character (character 0xffff) in ord at -e line 1.  ff,100,fffd,00

but horks if I don't:

   $ perl -le 'print join ",", map sprintf( "%02x", ord), ( split //, chr( 0xff ) . 
chr( 0x100 ) . chr( 0xfffd ) . chr( 0xffff ) )'
   Malformed UTF-8 character (unexpected non-continuation byte 0x00 after start byte 
0xc3) in ord at -e line 1.
   Malformed UTF-8 character (unexpected continuation byte 0xbf) in ord at -e line 1.
   Malformed UTF-8 character (unexpected non-continuation byte 0x00 after start byte 
0xc4) in ord at -e line 1.
   Malformed UTF-8 character (unexpected continuation byte 0x80) in ord at -e line 1.
   Malformed UTF-8 character (unexpected non-continuation byte 0x00 after start byte 
0xef) in ord at -e line 1.
   Malformed UTF-8 character (unexpected continuation byte 0xbf) in ord at -e line 1.
   Malformed UTF-8 character (unexpected continuation byte 0xbd) in ord at -e line 1.
   Malformed UTF-8 character (unexpected non-continuation byte 0x00 after start byte 
0xef) in ord at -e line 1.
   Malformed UTF-8 character (unexpected continuation byte 0xbf) in ord at -e line 1.
   Malformed UTF-8 character (unexpected continuation byte 0xbf) in ord at -e line 1.
   00,00,00,00,00,00,00,00,00,00

I'll need to dig in to this a bit more with unpack, etc. I guess, unless
somebody who knows and loves 5.6's utf8-ness can see some happy place
I'm missing.

For reference, the regex causing the grief in the first place is:

   sub _escape($) {
       my $s = shift ;
       $s =~ s{([^\040-\177])}{
           exists $escapes{$1}
               ? $escapes{$1}
               : sprintf( "\\x{%04x}", ord $1 ) ;
       }ge;

       $s;
   }

If anyone can see a good way to walk this string byte by byte on older
perls and do either that or get automatic adjustments to utf8 mode to
work on 5.6.1, I'll be quite grateful.

- Barrie

Reply via email to