Earl Hood <[EMAIL PROTECTED]> writes: >Take the following code snippet: > > use Encode q(:all); > print $Encode::VERSION, "\n"; > > my $org = ''; > for my $i (0x20..0xFF){ > $org .= chr($i); > } > my $src = $org; > print "\nASCII -> UTF8\n"; > from_to($src, 'ascii', 'utf8', FB_XMLCREF); > print $src, "\n"; > >Prints out the following: > > 1.83 > > ASCII -> UTF8 > !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_` > abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87 > \x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97 > >After some further hacking, I notices that the success of the >FB_XMLCREF constant is not consistent. I add the following to the >script above: > > my $src = $org; > print "\nISO-8859-3 -> ISO-8859-8\n"; > from_to($src, 'iso-8859-3', 'iso-8859-8', FB_XMLCREF); > print $src, "\n"; > > > ISO-8859-3 -> ISO-8859-8 > !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_` > abcdefghijklmnopqrstuvwxyz{|}~ > Ħ˘£¤\xA5Ĥ§¨ > >Any insights to this behavior will be appreciated.
from_to is implemented by translating 'from' source to Unicode, and 'to' destination. The FB_XMLCREF happens on the 'to' side. Your original code suffers from fallbacks occuring on the 'from' side. 0x80..0xFF are not ASCII. So when you use an 8-bit encoding like iso8859-3 you don't see the problem. The behaviour is (almost) by design - i.e. it happened that way and I decided it made a kind of sense. Using ASCII is considered as asking for 7-bit ness. If you want one of 8-bit super-sets use the one you want (iso8859-1 aka latin1 most likely, but perhaps one of the windows ones with smart quotes, m-dash etc.) There is a good case for a "latin-guess" or latin-superset or ... which trys to do the right thing. -- Nick Ing-Simmons http://www.ni-s.u-net.com/