On Mon, Mar 26, 2007 at 05:28:43PM -0400, SrinTuar wrote: > I frequenty run into problems with utf-8 in perl, and I was wondering > if anyone else had encountered similar things.
Of course :-) I guess everyone who has ever tried perl's utf8 support faced similar problems. > One thing I've noticed is that when processing characters, I often get > warnings about > "wide characters in print", or have input/output get horribly mangled. Yes, that's the case if some function expects bytes but receives an utf-8 string that happens to contain characters not representable in latin-1 or something like that. Printing them is the usual case when this message appears, but recently I happened to face the same with the crypt() call when I wanted to crypt utf-8 passwords. > Ive been trying to work around it in various ways, commonly doing thing > such as: > binmode STDIN,":utf8"; > binmode STDOUT,":utf8"; This is okay, it turns these file descriptors to UTF-8 mode. I usually begin my perl scripts with #!/usr/bin/perl -CSDA this turns on utf8 mode for all the file descriptors (except for sockets), see "man perlrun" for details. It's also advisable to have a "use utf8;" pragma at the beginning of the file, in case the source itself contains utf-8 characters. > or using functions such as : > sub unfunge_string > { > foreach my $ref (@_) > { > $$ref = Encode::decode("utf8",$$ref,Encode::FB_CROAK); > } > } I can't understand this. Here I guess your goal is to convert an internally UTF-8 encoded string into sequence of bytes that can be passed into any file descriptor. In this case you need the encode function, decode is the reverse way. E.g. use utf8; use Encode; my $password_string = "pásswőrd"; #my $encoded_wrong = crypt($password_string, "xx"); # wrong! my $password_bytes = Encode::encode("utf8", $password_string); my $encoded_good = crypt($password_bytes, "xx"); # gives "xx2TrBZ2zni6o" I also used this trick recently when writing to a socket. If you turn on the utf8 flag on the socket, you can simply send the utf8 string over it, but the return value of the send() call is the number of characters written, and I don't know what happens if a character is only partially sent. If you send bytes instead of characters, you have total control over this. The choice is yours. > For a language that really goes out of its way to support encodings, I > wonder if it > wouldnt have been better off it it just ignored the entire concept > alltogether and treated > strings as arrays of bytes... That would be contradictory to the whole concept of Unicode. A human-readable string should never be considered an array of bytes, it is an array of characters! The problem is that in the good old days perl only knew about strings as array of bytes, and later they had to implement Unicode support without breaking backwards compatibility. Hence currently perl strings are being used to store both types of data: byte sequences and array sequences. For each string variable, there's a bit telling whether it's known to be an UTF-8 encoded human-readable string. See "man Encode" and the is_utf8, _utf8_on and _utf8_off functions. You can think of this bit as the piece of information whether this string is to be assumed a sequence of bytes or a sequence of characters. Having this bit set when the string is not valid utf8 can yield unexpected behavior - never do that. However, having an otherwise valid utf-8 which doesn't have this bit set is perfectly valid, and in some circumstances (e.g. when printing to a file) it behaves differently - e.g. if the file descriptor has its utf8 flag set, but the string doesn't, then IIRC it is converted from latin1 to utf8 and hence you'll have a different result, not what you expect. In some cases you might want to use _utf8_on, this happens when you know that the string is utf8 but perl doesn't know this. An example is gettext lookup if you've used bind_textdomain_codeset("...", "UTF-8"). In this case the gettext() call always returns a valid utf-8 string but perl doesn't know it would, so this bit is not set. Perl automatically sets this utf8 bit on strings read from file descriptors having the utf8 mode turned on, on strings within the perl source if "use utf8" is in effect, on the output of decode("charset", "text"), in the obvious string operations, e.g. if you concat two strings with this bit set, do regexp matching on an utf8 string, join/split it and so on............ > And I'm wondering if in its attempt to be a good i18n citizen, perl > hasnt gone overboard and made a mess of things instead. Probably. Just look around and see how many pieces of software, file formats, protocols and so on suffer from the same problem (no, not particularly utf8, I mean being an overcomplicated mess) due to compatibility issues. Plenty! Perl is just one of them, far from being the worst :-) -- Egmont -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/