Hi! Superb Torsten!
I played around with UTF-8 last week and it was way less hassle than I thought ;-) My environment looks like this: [EMAIL PROTECTED] LD_LIBRARY_PATH=/u01/app/oracle/product/10.1.0/client_1/lib NLS_LANG=AMERICAN_AMERICA.AL32UTF8 ORACLE_BASE=/u01/app/oracle ORACLE_HOME=/u01/app/oracle/product/10.1.0/client_1 ORA_NLS33=/u01/app/oracle/product/10.1.0/client_1/nls/data I didn't use 3. and 4. till now and just changed using 3 as well. @5: I use oracle10g, DBD-Oracle 1.16 and DBI 1.46 and UTF-8 support works out-of-the-box. My Oracle is configured to use AMERICAN_AMERICA.AL32UTF8. My Oracle even uses 4bytes for 1char (in the 'worst' case). @7: With my setup I didn't need to convert anything to make UTF-8 work. Oracle passes UTF-8, Embperl doesn't touch it (with epchar.c.min) and the browser selects the right encoding. My default header for every page looks like this: [$ sub page_xhtml$] <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <meta http-equiv="Content-type" content="text/html; charset=UTF-8" /> [$ endsub $] The <meta> flag is needed by IE to choose UTF-8 encoding (maybe not if ' AddDefaultCharset UTF-8' is configured in apache2). @8: i also use utf8::decode() for $fdat (but only for vars that need to be converted). Thanks for the mini-howto Torsten and @Gerald maybe you can add that to the Embperl documentation as well. Alex -----UrsprÃngliche Nachricht----- Von: Torsten Luettgert [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 25. November 2004 17:49 An: [EMAIL PROTECTED] Betreff: Little Embperl UTF-8 HOWTO Hi, in contrast to my usual reasons for posting, this mail is not for whining about some Embperl bug, but because I wanted to wrap up what one has to do in order to use Embperl in database-driven web applications with UTF-8. Perhaps it'll help someone. It certainly took me a while to figure all this out. 1. Versions Use the latest Embperl release. ATM this is Embperl-2.0rc2. You also need a recent perl. It should work from 5.8.1 on (because that one has "use utf8", see below), but I only know for sure that 5.8.3 works. 2. Embperl compilation Before doing the usual "perl Makefile.PL; make; make install UNINST=1", copy the file epchar.c.min over epchar.c. Because if you don't, in some cases special characters are quoted and you see garbage instead of the special character. 3. Configure your apache (I just assume you're using apache 2.0.x) to UTF-8 as default character set with the directive AddDefaultCharset UTF-8 4. Source code in UTF-8 Write all of your perl files in UTF-8. If you're using vim, you can convert your source files by opening them, doing ":se fileencoding=utf8" and saving them again. You should also tell Perl that the source is UTF-8 with the line use utf8; Regrettably, that pragma only marks the current lexical scope as being UTF-8. I had to insert [- use utf8; -] at the beginning of each and every Embperl source file I have. A bit unelegant, but I didn't find a better solution (why it is important for Perl to know that input is UTF-8 is explained in 6. "the utf-8 flag"). 5. Your database One could probably write big books about databases and UTF-8. I used PostgreSQL, mysql and DB2. mysql and PostgreSQL allow to set an encoding at database creation time: createdb --encoding=utf8 dbname (pg) CREATE DATABASE dbname CHARACTER SET utf8; (mysql) DB2 is blissfully utf-8-unaware. Be warned: Except for PostgreSQL, if you have a CREATE TABLE test ( ministr varchar(2) ); you can store 2 BYTES in ministr, NOT 2 characters! Meaning, you could store 'ab', 'Ã', but not 'Ãh'. And the euro symbol 'â' is too big for ministr! PostgreSQL allows 2 characters, so even 'ââ' would fit. 6. The utf-8 flag Perl now distinguishes between characters and bytes (which makes sense considering that "â" is 3 bytes, but 1 char). It also attaches a flag to every string, the utf-8 flag, which tells if the string is in Perl's "internal format" (which is UTF-8). You REALLY want this flag set on all of your strings that are not pure ASCII, because strings are not 'equal' if the flag value differs, even if the string without utf-8 flag contains the same bytes. Literal strings have this flag automatically set if you did "use utf8;". If you're not sure if a string has the flag, use the DBI function "data_string_desc". Here's an example: #!/usr/bin/perl $a = "ÃÃÃ"; use utf8; $b = "ÃÃÃ"; if( $a eq $b ) { print "Strings are equal.\n"; }else{ print "Strings are NOT equal.\n"; } use DBI; print "a: ".DBI::data_string_desc($a)."\n"; print "b: ".DBI::data_string_desc($b)."\n"; This prints the following: Strings are NOT equal. a: UTF8 off, non-ASCII, 6 characters 6 bytes b: UTF8 on, non-ASCII, 3 characters 6 bytes Which is pretty much self-explanatory. Strings are converted into the internal format with utf8::decode($string); Yes, that's decode, not encode, because from Perl's point of view, the character ENCODING (even if it is the same as Perl uses internally) is undone and it is converted to the internal format. Oh, and a little caveat: only decode strings ONCE. Everything else is asking for trouble. To be sure about that, use utf8::decode($string) unless utf8::is_utf($string); If you need to convert strings from other encodings, use the Encode module. Example for latin-1: use Encode; $internal_string = Encode::decode('iso-8859-1', $string); 7. DBI and the utf-8 flag If you're connecting to a database, you probably use the DBI module (and if you don't, you probably should :-) Strings coming from DBI have, at the time being, the utf-8 flag NOT set (at least in DBI version 1.46 which I'm using). It is on their TODO list, though. Since you really want that flag on, as explained in 6., you need to convert every string from the database to Perl's internal format with the flag on (if it's not pure ASCII). I was lucky in this, because I use the same wrapper functions for SQL access everywhere, and only had to add utf8::decode() to them (as explained in 6.) 8. Embperl and the utf-8 flag Strings stored to %udat keep their utf-8 flag, so you don't need to worry about that. %fdat is different, however (Gerald has it on the post-2.0 TODO list). Here also applies what I said in 7. %fdat can be converted by doing foreach my $k (keys %fdat) { utf8::decode($fdat{$k}); } That's about all I found out about on my journey to Embperl with UTF-8. For more documentation, see perldoc utf8 perldoc Encode perldoc perlunicode Greetings, Torsten --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] *"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"* Hinweis: Dieses E-mail kann vertrauliche und geschützte Informationen enthalten. Sollten Sie nicht der beabsichtigte Empfänger sein, verständigen Sie bitte den Absender und löschen Sie dieses E-mail dann sofort. Notice: This e-mail contains information that is confidential and may be privileged. If you are not the intended recipient, please notify the sender and then delete this e-mail immediately. *"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*