Hi, in contrast to my usual reasons for posting, this mail is not for whining about some Embperl bug, but because I wanted to wrap up what one has to do in order to use Embperl in database-driven web applications with UTF-8. Perhaps it'll help someone. It certainly took me a while to figure all this out.
1. Versions Use the latest Embperl release. ATM this is Embperl-2.0rc2. You also need a recent perl. It should work from 5.8.1 on (because that one has "use utf8", see below), but I only know for sure that 5.8.3 works. 2. Embperl compilation Before doing the usual "perl Makefile.PL; make; make install UNINST=1", copy the file epchar.c.min over epchar.c. Because if you don't, in some cases special characters are quoted and you see garbage instead of the special character. 3. Configure your apache (I just assume you're using apache 2.0.x) to UTF-8 as default character set with the directive AddDefaultCharset UTF-8 4. Source code in UTF-8 Write all of your perl files in UTF-8. If you're using vim, you can convert your source files by opening them, doing ":se fileencoding=utf8" and saving them again. You should also tell Perl that the source is UTF-8 with the line use utf8; Regrettably, that pragma only marks the current lexical scope as being UTF-8. I had to insert [- use utf8; -] at the beginning of each and every Embperl source file I have. A bit unelegant, but I didn't find a better solution (why it is important for Perl to know that input is UTF-8 is explained in 6. "the utf-8 flag"). 5. Your database One could probably write big books about databases and UTF-8. I used PostgreSQL, mysql and DB2. mysql and PostgreSQL allow to set an encoding at database creation time: createdb --encoding=utf8 dbname (pg) CREATE DATABASE dbname CHARACTER SET utf8; (mysql) DB2 is blissfully utf-8-unaware. Be warned: Except for PostgreSQL, if you have a CREATE TABLE test ( ministr varchar(2) ); you can store 2 BYTES in ministr, NOT 2 characters! Meaning, you could store 'ab', 'ä', but not 'äh'. And the euro symbol '¤' is too big for ministr! PostgreSQL allows 2 characters, so even '¤¤' would fit. 6. The utf-8 flag Perl now distinguishes between characters and bytes (which makes sense considering that "¤" is 3 bytes, but 1 char). It also attaches a flag to every string, the utf-8 flag, which tells if the string is in Perl's "internal format" (which is UTF-8). You REALLY want this flag set on all of your strings that are not pure ASCII, because strings are not 'equal' if the flag value differs, even if the string without utf-8 flag contains the same bytes. Literal strings have this flag automatically set if you did "use utf8;". If you're not sure if a string has the flag, use the DBI function "data_string_desc". Here's an example: #!/usr/bin/perl $a = "äöü"; use utf8; $b = "äöü"; if( $a eq $b ) { print "Strings are equal.\n"; }else{ print "Strings are NOT equal.\n"; } use DBI; print "a: ".DBI::data_string_desc($a)."\n"; print "b: ".DBI::data_string_desc($b)."\n"; This prints the following: Strings are NOT equal. a: UTF8 off, non-ASCII, 6 characters 6 bytes b: UTF8 on, non-ASCII, 3 characters 6 bytes Which is pretty much self-explanatory. Strings are converted into the internal format with utf8::decode($string); Yes, that's decode, not encode, because from Perl's point of view, the character ENCODING (even if it is the same as Perl uses internally) is undone and it is converted to the internal format. Oh, and a little caveat: only decode strings ONCE. Everything else is asking for trouble. To be sure about that, use utf8::decode($string) unless utf8::is_utf($string); If you need to convert strings from other encodings, use the Encode module. Example for latin-1: use Encode; $internal_string = Encode::decode('iso-8859-1', $string); 7. DBI and the utf-8 flag If you're connecting to a database, you probably use the DBI module (and if you don't, you probably should :-) Strings coming from DBI have, at the time being, the utf-8 flag NOT set (at least in DBI version 1.46 which I'm using). It is on their TODO list, though. Since you really want that flag on, as explained in 6., you need to convert every string from the database to Perl's internal format with the flag on (if it's not pure ASCII). I was lucky in this, because I use the same wrapper functions for SQL access everywhere, and only had to add utf8::decode() to them (as explained in 6.) 8. Embperl and the utf-8 flag Strings stored to %udat keep their utf-8 flag, so you don't need to worry about that. %fdat is different, however (Gerald has it on the post-2.0 TODO list). Here also applies what I said in 7. %fdat can be converted by doing foreach my $k (keys %fdat) { utf8::decode($fdat{$k}); } That's about all I found out about on my journey to Embperl with UTF-8. For more documentation, see perldoc utf8 perldoc Encode perldoc perlunicode Greetings, Torsten --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]