Little Embperl UTF-8 HOWTO

Torsten Luettgert Thu, 25 Nov 2004 08:48:52 -0800

Hi,

in contrast to my usual reasons for posting, this mail is not for
whining about some Embperl bug, but because I wanted to wrap up what one
has to do in order to use Embperl in database-driven web applications
with UTF-8. Perhaps it'll help someone. It certainly took me a while to
figure all this out.



1. Versions
Use the latest Embperl release. ATM this is Embperl-2.0rc2.
You also need a recent perl. It should work from 5.8.1 on (because that
one has "use utf8", see below), but I only know for sure that 5.8.3
works.


2. Embperl compilation
Before doing the usual "perl Makefile.PL; make; make install UNINST=1",
copy the file epchar.c.min over epchar.c.
Because if you don't, in some cases special characters are quoted and
you see garbage instead of the special character.


3. Configure your apache (I just assume you're using apache 2.0.x) to
UTF-8 as default character set with the directive

AddDefaultCharset UTF-8


4. Source code in UTF-8
Write all of your perl files in UTF-8. If you're using vim, you can
convert your source files by opening them, doing ":se fileencoding=utf8"
and saving them again.
You should also tell Perl that the source is UTF-8 with the line

use utf8;

Regrettably, that pragma only marks the current lexical scope as being
UTF-8. I had to insert [- use utf8; -] at the beginning of each and
every Embperl source file I have. A bit unelegant, but I didn't find
a better solution (why it is important for Perl to know that input is
UTF-8 is explained in 6. "the utf-8 flag").


5. Your database
One could probably write big books about databases and UTF-8. I used
PostgreSQL, mysql and DB2. mysql and PostgreSQL allow to set an encoding
at database creation time:

createdb --encoding=utf8 dbname             (pg)
CREATE DATABASE dbname CHARACTER SET utf8;  (mysql)
DB2 is blissfully utf-8-unaware.

Be warned: Except for PostgreSQL, if you have a
CREATE TABLE test (
  ministr varchar(2)
);
you can store 2 BYTES in ministr, NOT 2 characters! Meaning,
you could store 'ab', 'ä', but not 'äh'. And the euro symbol '¤' is
too big for ministr!
PostgreSQL allows 2 characters, so even '¤¤' would fit.


6. The utf-8 flag
Perl now distinguishes between characters and bytes (which makes sense
considering that "¤" is 3 bytes, but 1 char). It also attaches a flag to
every string, the utf-8 flag, which tells if the string is in Perl's
"internal format" (which is UTF-8).
You REALLY want this flag set on all of your strings that are not pure
ASCII, because strings are not 'equal' if the flag value differs, even
if the string without utf-8 flag contains the same bytes.
Literal strings have this flag automatically set if you did "use utf8;".

If you're not sure if a string has the flag, use the DBI function
"data_string_desc". Here's an example:

#!/usr/bin/perl
$a = "äöü";
use utf8;
$b = "äöü";
if( $a eq $b ) {
  print "Strings are equal.\n";
}else{
  print "Strings are NOT equal.\n";
}
use DBI;
print "a: ".DBI::data_string_desc($a)."\n";
print "b: ".DBI::data_string_desc($b)."\n";

This prints the following:

Strings are NOT equal.
a: UTF8 off, non-ASCII, 6 characters 6 bytes
b: UTF8 on, non-ASCII, 3 characters 6 bytes

Which is pretty much self-explanatory.
Strings are converted into the internal format with

utf8::decode($string);

Yes, that's decode, not encode, because from Perl's point of
view, the character ENCODING (even if it is the same as Perl uses
internally) is undone and it is converted to the internal format.
Oh, and a little caveat: only decode strings ONCE. Everything else
is asking for trouble. To be sure about that, use

utf8::decode($string) unless utf8::is_utf($string);

If you need to convert strings from other encodings, use the Encode
module. Example for latin-1:

use Encode;
$internal_string = Encode::decode('iso-8859-1', $string);


7. DBI and the utf-8 flag
If you're connecting to a database, you probably use the DBI module
(and if you don't, you probably should :-)
Strings coming from DBI have, at the time being, the utf-8 flag NOT set
(at least in DBI version 1.46 which I'm using). It is on their TODO
list, though.

Since you really want that flag on, as explained in 6., you need to
convert every string from the database to Perl's internal format with
the flag on (if it's not pure ASCII). I was lucky in this, because I use
the same wrapper functions for SQL access everywhere, and only had to
add utf8::decode() to them (as explained in 6.)


8. Embperl and the utf-8 flag
Strings stored to %udat keep their utf-8 flag, so you don't need to
worry about that.
%fdat is different, however (Gerald has it on the post-2.0 TODO list).
Here also applies what I said in 7. %fdat can be converted by doing

foreach my $k (keys %fdat) {
  utf8::decode($fdat{$k});
}

That's about all I found out about on my journey to Embperl with UTF-8.
For more documentation, see

perldoc utf8
perldoc Encode
perldoc perlunicode

Greetings,
Torsten


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Little Embperl UTF-8 HOWTO

Reply via email to