Re: mixing of utf8 and latin1 in URI->query_form\rr

Bill Moseley Mon, 05 Mar 2012 02:32:36 -0800

On Mon, Mar 5, 2012 at 3:54 PM, Tim Brody <t...@ecs.soton.ac.uk> wrote:

> On Sat, 2012-03-03 at 15:25 +0700, Bill Moseley wrote:
> > I use HTTP::Request::Common to build an application/x-www-form-urlencoded
> > POST from a passed-in hash.  The hash contains strings as values.
> >
> > $req = POST /foo, \%parameters;
> >
> >
> <snip>
> >
> > The thing to notice here is how the encoding for $latin1 changed just
> > because of the addition into the hash of the $unicode string.  Things
> thus
> > break when the server tries to decode the query parameters on the server
> > side if it assumes either latin1 or utf8 encoding.
>

I'm sorry, I made a mistake in my example.  I meant:

 $req = POST /foo, Content => \%parameters

It's the request *body* that I'm talking about, not the query parameters.

I understand this is a known issue in URI:

           The escaping (percent encoding) of chars in the 128 .. 255 range
           passed to the URI constructor or when setting URI parts using the
           accessor methods depend on the state of the internal UTF8 flag
(see
           utf8::is_utf8) of the string passed.  *If the UTF8 flag is set
the*
*           UTF-8 encoded version of the character is percent encoded.  If
the*
*           UTF8 flag isn't set the Latin-1 version (byte) of the character
is*
*           percent encoded.  This basically exposes the internal encoding
of*
*           Perl strings. *

And because the same character string can be represented either without
(latin1) or with (utf8) the UTF8 flag, and Perl can upgrade character
strings from latin1 to utf8 w/o me knowing, I cannot be be sure exactly
what percent encoding will be used.  It's not really a bug, rather it's
just not clear what percent encoding will be used.

That is, with two hashes passed to query_form, the second only with the
addition of the "unicode" key with the value that has the utf8 flag true,
changes the percent encoding of "latin1" value.

{ ascii => $ascii, latin1 => $latin1 }
{ ascii => $ascii, latin1 => $latin1, unicode => $unicode }

Gisle, If I were to override query_form do you see any problems with either
of these approaches to make sure that the final percent encoding is always
of utf8 encoded octets?  One by explicitly encoding everything to utf8
first:

my %encoded_params = map { uri_escape( encode_utf8($_) ) }  %{$params};
my $query = join '&', map { "$_=$encoded_params{$_}" } keys %encoded_params;

Or the other approach would be to utf8::upgrade each value (and key) in the
hash and let URI's query_param to work as-is.

-- 
Bill Moseley
mose...@hank.org

Re: mixing of utf8 and latin1 in URI->query_form\rr

Reply via email to