AW: Perl appears to be introducing whitespace when reading .txt files

Thomas Bätzler Thu, 01 Apr 2010 02:26:37 -0700

Doug Cacialli <doug.cacia...@gmail.com> wrote:
> The unicode / UTF16 issues presented by Thomas and Dr. Rudd are a
> little beyond me.  I'm reading up now, but can someone shed some
> light?


Let's see if I can quickly summarize:
- text is made up from letters (characters).
- a computer doesn't "know" letters, it only knows numbers (bytes).
- an encoding specifies how letters are mapped onto numbers. In classic ASCII 
encoding, the byte value 65 represents the character "A".
- For a long time, the most common encodings where single-byte encodings that 
basically could represent up to 256 different characters.
- There are also multi-byte encodings that can encode many more characters. One 
of those is Unicode which endeavours to be "the one ring" of character 
encodings.
- There are several storage formats for Unicode. One of those is UTF-16, which 
encodes most of what we'd consider "common" characters in two bytes each, while 
using 4 bytes for the rest.
- One of the big problem with plain text files is that there's usually no hint 
as to their encoding.

So what we assume is happening is that your data is stored as UTF-16, where 
"standard ASCII" characters are stored as two bytes of data.

Your program doesn't specify that the data is UTF-16, so perl reads it as 
single-byte encoded data, effectively introducing lots of <null> characters 
into the string.

You now have two options: either convert your input, or modify your program to 
handle Unicode input.

In any case you could try this snipped in your code:
print "NUL character found in string - input might be UTF-16\n" if $line =~ 
m/\x00/;


MfG,
Thomas Bätzler
-- 
BRINGE Informationstechnik GmbH
Zur Seeplatte 12
D-76228 Karlsruhe
Germany

Fon: +49 721 94246-0
Fon: +49 171 5438457
Fax: +49 721 94246-66
Web: http://www.bringe.de/

Geschäftsführer: Dipl.-Ing. (FH) Martin Bringe
Ust.Id: DE812936645, HRB 108943 Mannheim



--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

AW: Perl appears to be introducing whitespace when reading .txt files

Reply via email to