At 8:03 pm +0000 9/3/05, [EMAIL PROTECTED] wrote:
here's my perl -V
Summary of my perl5 (revision 5 version 8 subversion 6) configuration:
So ignore anything you've been told about previous versions.
Basically I have xC3 x84 and let perl think it is utf-8. It is valid utf-8 ie A with diaresis.
Yes.
I don't understand what the [UTF8 "\x{c4}"] is telling me. xc4 is not valid utf-8. It is however valid unicode as xc4 is a precomposed char. What's worse is that the output file contains xc4 and not the utf-8 sequence I expected.
The script below will result in two identical files both containing two bytes "\xC3" and "\x84". If you read them raw you will get two characters. If you read them as UTF-8 you will get a single character A with diaeresis. If you read them as UCS-2 you will get the single character HANGUL SYLLABLE SSE. How you read them and how you display them with make no difference to the content of the files.
#!/usr/bin/perl -w use strict; binmode STDOUT, "utf8"; # then try omitting this my $fin = "/tmp/in.txt"; my $fout = "/tmp/out.txt"; # Create a test file to read open FIN, ">$fin" or die $!; print FIN "\xC3\x84"; # write two bytes to $fin close FIN; # Get the text from $fin open FIN, "<:raw", $fin ; # then try omitting the ' "<:raw", ' my $text = <FIN>; close FIN; # Print $text utf-8 encoded to $fout open FOUT, ">$fout"; print FOUT $text; close FOUT; # Read $fout as UTF-8 open FOUT, "<:utf8", $fout; $text = <FOUT>; close FOUT; print "YES, I AM \\x{00C4}\n" if $text eq "\x{00C4}"; print $text. "....", length $text, $/; # Read $fout as raw bytes open FOUT, "<:raw", $fout; $text = <FOUT>; close FOUT; print $text. "....", length $text, $/; # See what the system thinks my $output = `cat $fout`; print $output, "....", length $output;