Egmont Koblinger wrote on 2007-05-31 18:35 UTC:
> > How can I then switch between a "byte string" and a "character string"
> 
> I guess you're looking for Encode::_utf8_{on,off}

Looks good, but can't get this to work either:

#!/usr/bin/perl
use Encode;
$s = pack("C2", 0xc2, 0xa9); # binary string containing COPYRIGHT SIGN
print "length=", length($s),"\n"; # gives 2
print "utf8=", Encode::is_utf8($s),"\n"; # gives false
# Convert non-ASCII UTF-8 into XML numeric character reference
$s =~ 
s/([\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})/Encode::_utf8_on($1),sprintf("&#x%02X;",
 ord($1))/ge;
print "$s\n"; # we want to see here: ©

$ ./test.pl
length=2
utf8=
Â

Is there something special about $1 inside a s/.../.../ge expression
that prevents the application of Encode::_utf8_on($1)?

Seems so, since

$s =~ 
s/([\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})/$a
 = $1,Encode::_utf8_on($a),sprintf("&#x%02X;", ord($a))/ge;

does the trick.

Markus

-- 
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to