Re: XSP and encoding

Jörg Walter Thu, 28 Feb 2002 18:16:09 -0800

On Tuesday, 26. February 2002 17:37, Matt Sergeant wrote:
(Uhm, imagine this were addressed to tu Daisuke Maki :-)

> On Sat, 23 Feb 2002, Daisuke Maki wrote:
> > > Ah, that's a problem I've had before :) The absolutely totally latest
> > > AxKit CVS *might* fix that problem. If it doesn't, you might want to
> > > have a look at AxKit::XSP::CharConv:
> > >
> > > <xsp:expr>
> > >   <char:charset-convert from='EUC'>$foo</char:charset-convert>
> > > </xsp:expr>
> > >
> > > That should take care of converting whatever your input is to UTF-8
> > > (you can use it to convert to whatever, so long as your iconv supports
> > > it).
> >
> > I looked at the CVS, and I don't think this issue is addressed. And
> > quite frankly, I don't want to have to add more taglibs for every
> > "outside" variable that I need to use, so I have a proposition:

You are correct. The patch assumes (correctly) that UTF-8 is as Perl string 
encoding. Perl is UTF-8, like it or not. It may not yet work really well, 
still it is the declared behaviour. Moreover my experience shows it does work 
once you get the hang of it and abolish _all_ and _every_ non-UTF-8 handling.

The trick is to use XML::LibXML's encodetoUTF8($encoding,$string) as early as 
possible to convert any non-utf-8-data into UTF-8. And by the way: string 
_literals_ in XSP code will get converted automatically, since that is parsed 
text.

> > Is it ok if we force the result of a <xsp:expr> to be converted to UTF,
> > no matter what? Attatched is the diff of this proposition.

How would you determine encoding? The only way to determine encoding would be 
some encoding specifier, which would replicate the easy-to-use encodeToUtf8 
function. Moreover, perl strings already are UTF-8 (From a user's point of 
view they are, see above... don't let yourself be fooled by the fact that 
they behave a bit like byte strings and real utf8 support is a bit shaky... 
this is the future, and you better start using it now)

> I honestly think it's the wrong approach. The patch assumes all external
> data sources will be in the same encoding as the XSP source. This seems
> wrong to me. Taking the two examples I can think of - form params and DBI

This would be plain buggy. Take this as example:
<xsp:expr>$price." �"</xsp:expr> (for anyone not accustomed to the iso-8859-15 
charset, that char in the quotes is the euro sign)
Imagine $price="1.50".
If the result would get converted, we would end up with "1.50 €". Why that?
Upon parsing the XSP code, the expression is converted to: $price." €", but 
UTF-8 as encoding, and that 3 character sequence is just the UTF-8 euro sign.
If we evaluate that and convert it again (interpreting the part that is 
already UTF-8 as ISO-8859-15), these 3 characters are taken literally, not as 
UTF-8 euro sign.
The other option, not to convert document data to UTF8 but leave it as it is 
gets us into problems as well, because internally to all this, unicode is 
used, there could be unicode characters from all over the place somewhere, 
e.g. through entities, through numerical reference, through taglibs that 
behave the proper way, ...

> calls, both should theoretically (assuming you've setup your database
> correctly, and the modules you're using are working right) return data in

Modules should use UTF-8 as it is the way to go. Instead of kludging AxKit, 
fix your database to store native UTF-8 or your DB driver to do the 
conversion or whatever - having non-utf-8 in perl dataspace is asking for 
trouble in the long term.

> UTF-8 [*]. I may be wrong on the form param issue, but it seems to me that

Form params are unfortunately returned in something else. I have not read the 
relevant RFCs, but quick test showed to me that even mozilla, reknown for its 
standards conformance, returns form data in a local charset (iso-8859-15 for 
me). I did not check if the form data relates to the page encoding or 
something else. Bur if I recall correctly, only <form method="post" 
enctype="multipart/form-data"> (the php killer :-) has space for explicitly 
specifying an encoding somewhere. Unfortunately I know of no way to get at 
this data through Apache::Request - headers are only accessible for file 
uploads. Ultimately, the evaluation and conversion should be done 
transparently by that module, but it doesn't yet.

> Anyway, I do generally think it's a bad idea, but I'm not the definitive

It is an outright bug we would implement. Thus it IS a bad idea.

> [*] When I say UTF-8 here, I really mean Perl's internal representation,
> which is supposed to be transparent to us and just be "unicode" or
> "characters", but in reality we know that's not realistic and that modules
> must turn stuff into UTF-8 to be processable by Perl.

This disclaimer applies to my elaborations as well.

-- 
CU
        Joerg

PGP Public Key at http://ich.bin.kein.hoschi.de/~trouble/public_key.asc
PGP Key fingerprint = D34F 57C4 99D8 8F16 E16E  7779 CDDC 41A4 4C48 6F94

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: XSP and encoding

Reply via email to