Egmont Koblinger wrote on 2007-05-31 18:35 UTC:
> > How can I then switch between a "byte string" and a "character string"
>
> I guess you're looking for Encode::_utf8_{on,off}
Looks good, but can't get this to work either:
#!/usr/bin/perl
use Encode;
$s = pack("C2", 0xc2, 0xa9); # binary string containing COPYRIGHT SIGN
print "length=", length($s),"\n"; # gives 2
print "utf8=", Encode::is_utf8($s),"\n"; # gives false
# Convert non-ASCII UTF-8 into XML numeric character reference
$s =~
s/([\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})/Encode::_utf8_on($1),sprintf("&#x%02X;",
ord($1))/ge;
print "$s\n"; # we want to see here: ©
$ ./test.pl
length=2
utf8=
Â
Is there something special about $1 inside a s/.../.../ge expression
that prevents the application of Encode::_utf8_on($1)?
Seems so, since
$s =~
s/([\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})/$a
= $1,Encode::_utf8_on($a),sprintf("&#x%02X;", ord($a))/ge;
does the trick.
Markus
--
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/