Hi Randal.

Randal L. Schwartz wrote:
Getting really frustrated with mod_perl2's apparent inability to
probably read UTF8 input.

Here's my mod_perl2 setup:

  Apache 2.2.[something]
  mod_perl 2.0.7 (or nearly that)
  ModPerl::Registry
  Perl "script" with CGI.pm

Very early in my app:

  ## ensure utf8 CGI params:
  $CGI::PARAM_UTF8 = 1;

  binmode STDIN, ":utf8";
  binmode STDOUT, ":utf8";
  binmode STDERR, ":utf8";

This works fine in CGI mode: when I ask for $foo = $cgi->param('foo'),
DBI::data_string_desc($foo) shows a UTF8 string with the proper
discrepency between bytes and chars.

But when I try to run it under mod_perl, the returned string appears
to be the raw ascii bytes, and definitely not utf8.  Of course, when I
store that in the database (using DBD::Pg), the "latin-1" is encoded
to "utf-8", and I get a bunch of weird chars on the output.

Has anyone managed to round-trip UTF8 from form to database and back
using a setup similar to this?

I suspect part of the problem is this in CGI.pm:

    'read_from_client' => <<'END_OF_FUNC',
    # Read data from a file handle
    sub read_from_client {
    my($self, $buff, $len, $offset) = @_;
    local $^W=0;                # prevent a warning
    return $MOD_PERL
        ? $self->r->read($$buff, $len, $offset)
            : read(\*STDIN, $$buff, $len, $offset);
    }
    END_OF_FUNC

Since I binmode STDIN, the non-$MOD_PERL works ok here.  What's the
equivalent of $r->read() that marks the incoming stream as UTF8, so I
get chars instead of bytes?  Or can I just read(\*STDIN) in mod_perl2
as well? (I know that was supported at one point...)




I share your frustration, as I have been dealing for a long time with multi-lingual web applications, using perl and mod_perl.

First a very top-level comment : the basic problem here is the incompleteness of the HTTP RFC's, and the lack of proper support of international characters sets, even still today. When a browser is POST-ing the contents of the <input> elements of a <form> to a server, there is a set of arcane rules which, in principle, determine the character set in which this content is encoded. The problem is that these arcane rules are arcane, often confusing, and in addition regularly flouted by different browser makes and versions (not to even talk about umpteen non-browser proprietary HTTP client things).

For example, when a browser sends the content of a form in the "application/form-data" "enctype", the content of each form parameter is sent as a separate section, in a form similar to the parts in a multi-part RFC-822 email. In theory, each of these parts should have its own "content-type" header, and if it is text, it should also contain a "charset" attribute indicating the corresponding data's encoding. (and if it doesn't, by virtue of the HTTP RFC's, it should be ISO-8859-1, which is still the default HTTP character today; quite ridiculous, but so it is).

But the sad reality is that browser don't do that, and so in the practice in many cases the server-side application is reduced to "guessing".

By experience more than by definite code knowledge, I have to suppose that this kind of confusion sometimes also hits developers of modules such a CGI.pm and mod_perl, so that over the years, things have tended to vary from one version to another (versions of browsers, versions of perl, versions of mod_perl, versions of CGI.pm). Maybe also because of all the reasons above, there is just no "right" way of handling this, so CGI.pm always returns "bytes" (and libapreq2 may do things differwently).

In the end, rather than trying to follow the latest developments all the time and continuously patch my programs because of all this, I have resorted to some "defensive programming" techniques in terms of interpreting <form>-posted data, which have been working fine for me for the last few years. It may well be that they are a total overkill, but in the practice they have saved me a lot of time not spent wondering why the data in some application suddenly started to show up as "A tilde" followed by some bizarre graphic sign (or, at the opposite, as a question mark embedded in a losange).

(Even logging this stuff and trying to figure out what is going on is a pain, because you have to figure out first in what encoding you are logging, and second in what encoding you are viewing your logs).

The methodology I follow is as follows :

1) all html <form> pages of the applications should have a tag like :
<meta content-type="text/html; charset=.....">
2) all <forms> in the page should have the attributes
enctype="application/form-data"
accept-charset="....." (the same as above)

The above 2 things do not really guarantee anything, but at least they establish some "baseline" which helps in interpreting the rest (and slapping users when they change their browser settings).

3) all forms contain a hidden text <input> like
<input type="hidden" name="my-UTF8-check" value="AÜÖ.."> (some known sequence of "diacritics" characters guaranteed to have a different byte length between ISO-8859-x and UTF-8 encoding)

The point of this one is :
- all "your" forms have this parameter, so when you receive some posted data, you can reasonably assume that it is one of "your" forms that sent it. - if the browser sends the data in iso-8859-1, this string will be a certain length in bytes, and similarly for UTF-8. You can measure that length in a "use bytes;" section of the cgi-bin script. And you can also just compare this with some carefully-crafted string constant.

Then, on the server side, I have some code which systematically checks which is the encoding that is *really* seen by the program (cgi-bin script or mod_perl module) for these form input elements (using various clues from the server configuration, and the above received hidden form parameter). And when this code "knows" the received encoding, it then systemetically "sets" or not the perl "utf8" flag for these received cgi->param("x") values before actually using them (or encode/decode's them as appropriate). The point here being that the rest of your script can assume that all the param values are UTF-8 encoded, and known as such by Perl; and be done with it all.

I'm not saying that this is the cleverest and most elegant and most efficient way of dealing with this, nor that it is the answer you were looking for.
But it's helped me sleep better for quite a while now.

Reply via email to