Re: mod_perl and utf8 and CGI->param

André Warnier Wed, 03 Sep 2014 02:17:58 -0700

Hi Randal.

Randal L. Schwartz wrote:

Getting really frustrated with mod_perl2's apparent inability to
probably read UTF8 input.


Here's my mod_perl2 setup:

  Apache 2.2.[something]
  mod_perl 2.0.7 (or nearly that)
  ModPerl::Registry
  Perl "script" with CGI.pm

Very early in my app:

  ## ensure utf8 CGI params:
  $CGI::PARAM_UTF8 = 1;

  binmode STDIN, ":utf8";
  binmode STDOUT, ":utf8";
  binmode STDERR, ":utf8";

This works fine in CGI mode: when I ask for $foo = $cgi->param('foo'),
DBI::data_string_desc($foo) shows a UTF8 string with the proper
discrepency between bytes and chars.

But when I try to run it under mod_perl, the returned string appears
to be the raw ascii bytes, and definitely not utf8.  Of course, when I
store that in the database (using DBD::Pg), the "latin-1" is encoded
to "utf-8", and I get a bunch of weird chars on the output.

Has anyone managed to round-trip UTF8 from form to database and back
using a setup similar to this?

I suspect part of the problem is this in CGI.pm:

    'read_from_client' => <<'END_OF_FUNC',
    # Read data from a file handle
    sub read_from_client {
    my($self, $buff, $len, $offset) = @_;
    local $^W=0;                # prevent a warning
    return $MOD_PERL
        ? $self->r->read($$buff, $len, $offset)
            : read(\*STDIN, $$buff, $len, $offset);
    }
    END_OF_FUNC

Since I binmode STDIN, the non-$MOD_PERL works ok here.  What's the
equivalent of $r->read() that marks the incoming stream as UTF8, so I
get chars instead of bytes?  Or can I just read(\*STDIN) in mod_perl2
as well? (I know that was supported at one point...)

I share your frustration, as I have been dealing for a long time with multi-lingual webapplications, using perl and mod_perl.

First a very top-level comment : the basic problem here is the incompleteness of the HTTPRFC's, and the lack of proper support of international characters sets, even still today.When a browser is POST-ing the contents of the <input> elements of a <form> to a server,there is a set of arcane rules which, in principle, determine the character set in whichthis content is encoded. The problem is that these arcane rules are arcane, oftenconfusing, and in addition regularly flouted by different browser makes and versions (notto even talk about umpteen non-browser proprietary HTTP client things).

For example, when a browser sends the content of a form in the "application/form-data""enctype", the content of each form parameter is sent as a separate section, in a formsimilar to the parts in a multi-part RFC-822 email. In theory, each of these parts shouldhave its own "content-type" header, and if it is text, it should also contain a "charset"attribute indicating the corresponding data's encoding.(and if it doesn't, by virtue of the HTTP RFC's, it should be ISO-8859-1, which is stillthe default HTTP character today; quite ridiculous, but so it is).

But the sad reality is that browser don't do that, and so in the practice in many casesthe server-side application is reduced to "guessing".

By experience more than by definite code knowledge, I have to suppose that this kind ofconfusion sometimes also hits developers of modules such a CGI.pm and mod_perl, so thatover the years, things have tended to vary from one version to another (versions ofbrowsers, versions of perl, versions of mod_perl, versions of CGI.pm). Maybe also becauseof all the reasons above, there is just no "right" way of handling this, so CGI.pm alwaysreturns "bytes" (and libapreq2 may do things differwently).

In the end, rather than trying to follow the latest developments all the time andcontinuously patch my programs because of all this, I have resorted to some "defensiveprogramming" techniques in terms of interpreting <form>-posted data, which have beenworking fine for me for the last few years. It may well be that they are a totaloverkill, but in the practice they have saved me a lot of time not spent wondering why thedata in some application suddenly started to show up as "A tilde" followed by some bizarregraphic sign (or, at the opposite, as a question mark embedded in a losange).

(Even logging this stuff and trying to figure out what is going on is a pain, because youhave to figure out first in what encoding you are logging, and second in what encoding youare viewing your logs).


The methodology I follow is as follows :

1) all html <form> pages of the applications should have a tag like :
<meta content-type="text/html; charset=.....">
2) all <forms> in the page should have the attributes
enctype="application/form-data"
accept-charset="....." (the same as above)

The above 2 things do not really guarantee anything, but at least they establish some"baseline" which helps in interpreting the rest (and slapping users when they change theirbrowser settings).


3) all forms contain a hidden text <input> like

<input type="hidden" name="my-UTF8-check" value="AÜÖ.."> (some known sequence of"diacritics" characters guaranteed to have a different byte length between ISO-8859-x andUTF-8 encoding)


The point of this one is :

- all "your" forms have this parameter, so when you receive some posted data, you canreasonably assume that it is one of "your" forms that sent it.- if the browser sends the data in iso-8859-1, this string will be a certain length inbytes, and similarly for UTF-8. You can measure that length in a "use bytes;" section ofthe cgi-bin script. And you can also just compare this with some carefully-crafted stringconstant.

Then, on the server side, I have some code which systematically checks which is theencoding that is *really* seen by the program (cgi-bin script or mod_perl module) forthese form input elements (using various clues from the server configuration, and theabove received hidden form parameter).And when this code "knows" the received encoding, it then systemetically "sets" or not theperl "utf8" flag for these received cgi->param("x") values before actually using them (orencode/decode's them as appropriate).The point here being that the rest of your script can assume that all the param values areUTF-8 encoded, and known as such by Perl; and be done with it all.

I'm not saying that this is the cleverest and most elegant and most efficient way ofdealing with this, nor that it is the answer you were looking for.

But it's helped me sleep better for quite a while now.

Re: mod_perl and utf8 and CGI->param

Reply via email to