On Mon, Mar 26, 2007 at 05:28:43PM -0400, SrinTuar wrote:
> I frequenty run into problems with utf-8 in perl, and I was wondering
> if anyone else had encountered similar things.

Of course :-) I guess everyone who has ever tried perl's utf8 support faced
similar problems.

> One thing I've noticed is that when processing characters, I often get
> warnings about
> "wide characters in print", or have input/output get horribly mangled.

Yes, that's the case if some function expects bytes but receives an utf-8
string that happens to contain characters not representable in latin-1 or
something like that. Printing them is the usual case when this message
appears, but recently I happened to face the same with the crypt() call when
I wanted to crypt utf-8 passwords.

> Ive been trying to work around it in various ways, commonly doing thing 
> such as:
> binmode STDIN,":utf8";
> binmode STDOUT,":utf8";

This is okay, it turns these file descriptors to UTF-8 mode. I usually begin
my perl scripts with
#!/usr/bin/perl -CSDA
this turns on utf8 mode for all the file descriptors (except for sockets),
see "man perlrun" for details.

It's also advisable to have a "use utf8;" pragma at the beginning of the
file, in case the source itself contains utf-8 characters.

> or using functions such as :
> sub unfunge_string
> {
>    foreach my $ref (@_)
>    {
>        $$ref = Encode::decode("utf8",$$ref,Encode::FB_CROAK);
>    }
> }

I can't understand this. Here I guess your goal is to convert an internally
UTF-8 encoded string into sequence of bytes that can be passed into any file
descriptor. In this case you need the encode function, decode is the reverse
way. E.g.

use utf8;
use Encode;
my $password_string = "pásswőrd";
#my $encoded_wrong = crypt($password_string, "xx"); # wrong!
my $password_bytes = Encode::encode("utf8", $password_string);
my $encoded_good = crypt($password_bytes, "xx"); # gives "xx2TrBZ2zni6o"

I also used this trick recently when writing to a socket. If you turn on the
utf8 flag on the socket, you can simply send the utf8 string over it, but
the return value of the send() call is the number of characters written,
and I don't know what happens if a character is only partially sent. If you
send bytes instead of characters, you have total control over this. The
choice is yours.

> For a language that really goes out of its way to support encodings, I
> wonder if it
> wouldnt have been better off it it just ignored the entire concept
> alltogether and treated
> strings as arrays of bytes...

That would be contradictory to the whole concept of Unicode. A
human-readable string should never be considered an array of bytes, it is an
array of characters!

The problem is that in the good old days perl only knew about strings as
array of bytes, and later they had to implement Unicode support without
breaking backwards compatibility. Hence currently perl strings are being
used to store both types of data: byte sequences and array sequences.

For each string variable, there's a bit telling whether it's known to be an
UTF-8 encoded human-readable string. See "man Encode" and the is_utf8,
_utf8_on and _utf8_off functions. You can think of this bit as the piece of
information whether this string is to be assumed a sequence of bytes or a
sequence of characters. Having this bit set when the string is not valid
utf8 can yield unexpected behavior - never do that. However, having an
otherwise valid utf-8 which doesn't have this bit set is perfectly valid,
and in some circumstances (e.g. when printing to a file) it behaves
differently - e.g. if the file descriptor has its utf8 flag set, but the
string doesn't, then IIRC it is converted from latin1 to utf8 and hence
you'll have a different result, not what you expect.

In some cases you might want to use _utf8_on, this happens when you know
that the string is utf8 but perl doesn't know this. An example is gettext
lookup if you've used bind_textdomain_codeset("...", "UTF-8"). In this case
the gettext() call always returns a valid utf-8 string but perl doesn't know
it would, so this bit is not set.

Perl automatically sets this utf8 bit on strings read from file descriptors
having the utf8 mode turned on, on strings within the perl source if "use
utf8" is in effect, on the output of decode("charset", "text"), in the
obvious string operations, e.g. if you concat two strings with this bit set,
do regexp matching on an utf8 string, join/split it and so on............

> And I'm wondering if in its attempt to be a good i18n citizen, perl
> hasnt gone overboard and made a mess of things instead.

Probably. Just look around and see how many pieces of software, file
formats, protocols and so on suffer from the same problem (no, not
particularly utf8, I mean being an overcomplicated mess) due to
compatibility issues. Plenty! Perl is just one of them, far from being the
worst :-)



-- 
Egmont

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to