On Thursday 01 Apr 2010 12:25:51 Thomas Bätzler wrote: > Doug Cacialli <doug.cacia...@gmail.com> wrote: > > The unicode / UTF16 issues presented by Thomas and Dr. Rudd are a > > little beyond me. I'm reading up now, but can someone shed some > > light? > > Let's see if I can quickly summarize: > - text is made up from letters (characters). > - a computer doesn't "know" letters, it only knows numbers (bytes). > - an encoding specifies how letters are mapped onto numbers. In classic > ASCII encoding, the byte value 65 represents the character "A". - For a > long time, the most common encodings where single-byte encodings that > basically could represent up to 256 different characters. - There are also > multi-byte encodings that can encode many more characters. One of those is > Unicode which endeavours to be "the one ring" of character encodings. - > There are several storage formats for Unicode. One of those is UTF-16, > which encodes most of what we'd consider "common" characters in two bytes > each, while using 4 bytes for the rest. - One of the big problem with > plain text files is that there's usually no hint as to their encoding. >
Nice summary. Thanks for taking the time to write it. For further information see: * http://www.joelonsoftware.com/articles/Unicode.html ("The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)") * http://perldoc.perl.org/perlunitut.html - refers to the previous document right at the first paragraph. Regards and Happy April Fools' Day, Shlomi Fish > So what we assume is happening is that your data is stored as UTF-16, where > "standard ASCII" characters are stored as two bytes of data. > > Your program doesn't specify that the data is UTF-16, so perl reads it as > single-byte encoded data, effectively introducing lots of <null> > characters into the string. > > You now have two options: either convert your input, or modify your program > to handle Unicode input. > > In any case you could try this snipped in your code: > print "NUL character found in string - input might be UTF-16\n" if $line =~ > m/\x00/; > > > MfG, > Thomas Bätzler -- ----------------------------------------------------------------- Shlomi Fish http://www.shlomifish.org/ What Makes Software Apps High Quality - http://shlom.in/sw-quality Deletionists delete Wikipedia articles that they consider lame. Chuck Norris deletes deletionists whom he considers lame. Please reply to list if it's a mailing list post - http://shlom.in/reply . -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/