Re: UTF8, UTF-8, utf8, Utf8 encoding blues

Brandon McCaig Mon, 10 Nov 2014 15:22:51 -0800

Chris:

On Sat, Nov 8, 2014 at 12:05 PM, Chris Knipe <sav...@savage.za.org> wrote:
> I agree with you - and it also explains what we are seeing in
> terms of that certain data comes through clean, and others
> doesn't.  I too expect that the *entire stream* is not encoded
> with UTF8 (even though it should be).
>
> In terms of removing the last three characters, that is not
> what is causing the issue.  Even if I remove the substr and
> pass literally $Body = $Response, the data is still corrupted
> once it goes out via STDOUT.
>
> What is also VERY strange to me is that for some reason when I
> just do something simple like put a use Encoding into the
> script, everything works fine.  Then half an hour later, or a
> day or two later, and it stops working and starts becoming
> corrupt again.  And THIS is what has my mind completely
> baffled.  I am the only one with access to the servers (and
> code), and I am not even logged in on the machines when this
> sudden "change" in behaviour happens.


It looks like this thread is quiet for 2 days so does that mean
that you have solved your problem off-list, given up/failed, or
it is still in the works?

It doesn't really look like anybody skimmed through the basics of
UTF-8 support in Perl in this thread. Whether or not you have
already done that yourself appears to be uncertain, but it seems
clear from the OP that you aren't comfortable with it and aren't
confident in your understanding of it. I'm not an expert at
Unicode programming, or Perl, or Unicode programming in Perl, but
I have done a little bit of basic UTF-8 handling and for the most
part understand the basics of the API.

A few basics:

# The utf8 pragma is only useful if you want to write Perl source
# code directly in UTF-8. Doesn't affect how you read or write to
# streams. It affects how the source code itself, including
# string literals, are understood.

use utf8;

# The Encode module contains the core API for dealing with
# various encodings.

use Encode;

# This assumes that $foo is a UTF-8 encoded byte string. To Perl
# $foo is just an 8-bit string. $bar becomes a proper Perl
# character-based string with all of the Unicode characters being
# understood by Perl.

my $bar = Encode::decode('utf8', $foo);

# This is obviously the reverse of the above. $bar is considered
# a Perl string where Perl properly understands each character.
# $foo becomes a binary byte-string containing UTF-8 encoded
# characters as sequences of bytes.

my $foo = Encode::encode('utf8', $bar);

__END__

Perl has this concept of its strings being flagged as "UTF-8". In
short, strings are normally not flagged as containing UTF-8 data,
and are assumed to be an 8-bit encoding (e.g., US_ASCII or
Latin1). When a string is decoded in Perl (i.e., Encode::decode)
Perl will decide if the string it is decoding needs to be
internally represented using a multi-byte encoding and
automatically does so. It is transparent to the programmer. In
such a case, that string would be internally flagged as being in
UTF-8, and future character-based operations on that string would
automatically take this into account for you. Visa-versa, when a
character-based string is "encoded" you are turning these magic
strings back into a fixed byte-based string of raw data (these
bytes may or may not be characters).

The thing that we programmers need to understand is that Perl
doesn't know what the encoding of data coming from the outside is
so we have to tell it how to interpret (i.e., decode) it. After
that it does everything automatically. Similarly, Perl doesn't
know what encoding the outside world can handle so we need to
tell it how to encode data whenever we share data with the
outside world (i.e., outside of our application).

Note also that the encoding 'utf8' is "magic" in the sense that
it is not strict in its interpretation of UTF-8. As far as I
know, invalid characters will be silently ignored or perhaps just
silently included as is or something along those lines. Check the
documentation to be sure. I prefer to enforce UTF-8 strictness.

If you know that your data is supposed to be *valid* UTF-8 then I
would consider changing those 'utf8' encodings to literally
'UTF-8' (case-sensitive, IIRC). The 'UTF-8' encoding is the
strict mode that will signal errors when invalid characters are
read. Perhaps that will shed some light on the issue. Or not...

What you really need is a specification for the data that you're
reading. If you don't know what you're reading then it's
basically impossible to properly read it.

Regards,


-- 
Brandon McCaig <bamcc...@gmail.com> <bamcc...@castopulence.org>
Castopulence Software <https://www.castopulence.org/>
Blog <http://www.bambams.ca/>
perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }.
q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: UTF8, UTF-8, utf8, Utf8 encoding blues

Reply via email to