On Thursday 01 Apr 2010 12:25:51 Thomas Bätzler wrote:
> Doug Cacialli <doug.cacia...@gmail.com> wrote:
> > The unicode / UTF16 issues presented by Thomas and Dr. Rudd are a
> > little beyond me.  I'm reading up now, but can someone shed some
> > light?
> 
> Let's see if I can quickly summarize:
> - text is made up from letters (characters).
> - a computer doesn't "know" letters, it only knows numbers (bytes).
> - an encoding specifies how letters are mapped onto numbers. In classic
> ASCII encoding, the byte value 65 represents the character "A". - For a
> long time, the most common encodings where single-byte encodings that
> basically could represent up to 256 different characters. - There are also
> multi-byte encodings that can encode many more characters. One of those is
> Unicode which endeavours to be "the one ring" of character encodings. -
> There are several storage formats for Unicode. One of those is UTF-16,
> which encodes most of what we'd consider "common" characters in two bytes
> each, while using 4 bytes for the rest. - One of the big problem with
> plain text files is that there's usually no hint as to their encoding.
> 

Nice summary. Thanks for taking the time to write it. For further information 
see:

* http://www.joelonsoftware.com/articles/Unicode.html ("The Absolute Minimum 
Every Software Developer Absolutely, Positively Must Know About Unicode and 
Character Sets (No Excuses!)")

* http://perldoc.perl.org/perlunitut.html - refers to the previous document 
right at the first paragraph.

Regards and Happy April Fools' Day,

        Shlomi Fish

> So what we assume is happening is that your data is stored as UTF-16, where
> "standard ASCII" characters are stored as two bytes of data.
> 
> Your program doesn't specify that the data is UTF-16, so perl reads it as
> single-byte encoded data, effectively introducing lots of <null>
> characters into the string.
> 
> You now have two options: either convert your input, or modify your program
> to handle Unicode input.
> 
> In any case you could try this snipped in your code:
> print "NUL character found in string - input might be UTF-16\n" if $line =~
> m/\x00/;
> 
> 
> MfG,
> Thomas Bätzler

-- 
-----------------------------------------------------------------
Shlomi Fish       http://www.shlomifish.org/
What Makes Software Apps High Quality -  http://shlom.in/sw-quality

Deletionists delete Wikipedia articles that they consider lame.
Chuck Norris deletes deletionists whom he considers lame.

Please reply to list if it's a mailing list post - http://shlom.in/reply .

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to