Doug Cacialli <doug.cacia...@gmail.com> wrote: > The unicode / UTF16 issues presented by Thomas and Dr. Rudd are a > little beyond me. I'm reading up now, but can someone shed some > light?
Let's see if I can quickly summarize: - text is made up from letters (characters). - a computer doesn't "know" letters, it only knows numbers (bytes). - an encoding specifies how letters are mapped onto numbers. In classic ASCII encoding, the byte value 65 represents the character "A". - For a long time, the most common encodings where single-byte encodings that basically could represent up to 256 different characters. - There are also multi-byte encodings that can encode many more characters. One of those is Unicode which endeavours to be "the one ring" of character encodings. - There are several storage formats for Unicode. One of those is UTF-16, which encodes most of what we'd consider "common" characters in two bytes each, while using 4 bytes for the rest. - One of the big problem with plain text files is that there's usually no hint as to their encoding. So what we assume is happening is that your data is stored as UTF-16, where "standard ASCII" characters are stored as two bytes of data. Your program doesn't specify that the data is UTF-16, so perl reads it as single-byte encoded data, effectively introducing lots of <null> characters into the string. You now have two options: either convert your input, or modify your program to handle Unicode input. In any case you could try this snipped in your code: print "NUL character found in string - input might be UTF-16\n" if $line =~ m/\x00/; MfG, Thomas Bätzler -- BRINGE Informationstechnik GmbH Zur Seeplatte 12 D-76228 Karlsruhe Germany Fon: +49 721 94246-0 Fon: +49 171 5438457 Fax: +49 721 94246-66 Web: http://www.bringe.de/ Geschäftsführer: Dipl.-Ing. (FH) Martin Bringe Ust.Id: DE812936645, HRB 108943 Mannheim -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/