On Wed, 12 Nov 2003, Anthony Gardner wrote:

...

> So, with all the layers (Apache, Perl, XML, XSP,
> MySQL) all dealing with latin1, UTF-8 by default, what
> am I to do . Has anyone covered this?

Not all of them, and not in a single project.

To start with perl. The first version of perl having reliable Unicode
support was 5.8.0, and that is the one I have installed. The people on the
perl-unicode list make a strong distinction between unicode character
semantics and utf-8 encoding. Perl is using the former internally, and
capable of reading and writing the latter (and obviously quite a few other
encodings).

I think that, on most platforms and default locales, perl by expects input
and output streams, say STDIN and STDOUT, to be in iso-8859-1 and makes
the translation into unicode semantics from that assumption. If you read
utf-8 from STDIN (for example from a form with method post), you have to
explicately tell it to things in the right way, eg,

        binmode STDIN,':utf8';

and

        binmode STDOUT,':utf8';

You can also use encode/decode module

        $y = decode_utf8($x);

But bear in mind that when you have a utf-8 string, you may decode it into
perl's internal unicode semantics and then encode it as utf-8 again. Perl
doesn't use utf-8, it is using unicode.

Similar considerations holds for databases and text retrieval engines,
however those store data as encoded text. You have to decode before you
can do anything in perl.

When Perl is used to interface such programs one have to take precautions
-- and some of the APIs used predates the current unicode support in perl.

This is a piece of SQL where I create a table in mysql 4.something

last_name       VARCHAR(100) CHARACTER SET utf8    DEFAULT ''      NOT NULL,
first_name      VARCHAR(100) CHARACTER SET utf8    DEFAULT ''      NOT NULL,

The will be collation tables etc, but COLLATE utf8_general _ci didn't work
for me. You may need to encode your strings as utf-8 before storing the in
mysql.

Then we apache, axkit and other all other software packages that one has
to use in order to produce a service... My experience is that axkit does
the right thing when delivering stuff. The command

        HEAD http://sigge.lub.lu.se/2002/Master/cleaned-descriptions/Mh_54.xml

yields

        ...

        Content-Type: text/html; charset=utf-8

        ...

I.e., the Content-Type is set correctly by askit. Doing the same for the
corresponding static rendition

        http://sigge.lub.lu.se/2002/Master/html/Mh_54.html

doesn't yield the correct header. I reckon that renaming it to
Mh_54.html.utf8 would solve that problem.

I use the following

<xsl:output
encoding="utf-8"
method="xml"
content-type="text/html"
doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN"
doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";
/>

and somewhere down the style sheet there is

<xsl:element name="meta">
<xsl:attribute name="http-equiv">Content-Type</xsl:attribute>
<xsl:attribute name="content">text/html; charset=utf-8</xsl:attribute>
</xsl:element>

and that works fine with recent mozilla, firebird, MSIE and lynx.

Don't use content-type="text/html; charset=utf-8" in the xsl:output. That
is wrong according to the spec, and axkit assumes that you do it right...

I check with lynx because I believe that what I can use with lynx, that
can also be used by people having a braille terminal. This particular page
is a problem in lynx and MSIE since I haven't got a clue as to how I
should force it to load a font that include greek diacritics. The next
problem his to have proper settings in the CSS, and I suppose I'm not done
there.

I have two projects in the pipe-line involving Hebrew, Arabic, Russian,
Greek and possibly more languages in combination with English. Perl knows
about left-to-right and right-to-left script, but I'm not sure that
available search engines does...


Sigge

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to