I use HTTP::Request::Common to build an application/x-www-form-urlencoded POST from a passed-in hash. The hash contains strings as values.
$req = POST /foo, \%parameters; This uses URI->query_form to build the url-encoded body, and sets the Content-Type to application/x-www-form-urlencoded (without any charset). The problem I have is that the %parameters hash contains valid Perl character strings, but the resulting url-encoded request differs depending on the mix of Perl stings -- by mix I mean strings with and without Perl's utf8 flag. The resulting url-encoded request ends up either latin1 or utf-8 url-encoded octets. It's not easy to know what charset to add to the request, and likewise, things break if the server handling the request assumes it's a utf8 url-encoded request. Perhaps some code might help: Consider these three very normal, valid, *character* strings in Perl: my $ascii = 'Hello'; my $latin1 = 'Ue: ' . chr(220); my $unicode = "Happy \x{263A}"; What you would expect is that only the $unicode would have Perl' utf8 flag set. And indeed that is true: print_var($_) for $ascii, $latin1, $unicode; str [Hello] with flag: NO str [Ue: Ü] with flag: NO str [Happy ☺] with flag: YES And if the strings are concatenated Perl will utf8::upgrade $latin1, and you can see that is true because the umlaut survived the trip from latin1 to utf8. print_var( "Joined = '$ascii : $latin1 : $unicode'" ); str [Joined = 'Hello : Ue: Ü : Happy ☺'] with flag: YES Those three strings are perfectly fine Perl character strings, and they could be combined into a hash and fed to query_form(), This code simply passes a hashref to $uri->query_form then prints $uri->query; print_query( { ascii => $ascii } ); print_query( { ascii => $ascii, latin1 => $latin1 } ); print_query( { ascii => $ascii, latin1 => $latin1, unicode => $unicode } ); URI query_form = str [ascii=Hello] with flag: NO URI query_form = str [ascii=Hello&*latin1=Ue%3A+%DC*] with flag: NO URI query_form = str [ascii=Hello&unicode=Happy+%E2%98%BA&* latin1=Ue%3A+%C3%9C*] with flag: NO The thing to notice here is how the encoding for $latin1 changed just because of the addition into the hash of the $unicode string. Things thus break when the server tries to decode the query parameters on the server side if it assumes either latin1 or utf8 encoding. The problem is I have code that accepts a hash and passes it directly to POST. But, if there happens to be a latin1 string in there then the request changes depending if there's also a string with the utf8 flag set or not. Am I missing something here? Seems like if query_form is passed a hash then the resulting encoding should not change based on what else is in that hash. I can think of two solutions. One would be to build the query string in a different way by explicitly encoding to utf8 first: my %encoded_params = map { uri_escape( encode_utf8($_) ) } %{$params}; my $query = join '&', map { "$_=$encoded_params{$_}" } keys %encoded_params; Another way would be to explicitly utf8::upgrade every key and value in the hash before query_form() does it's work. Obviously, that would break anyone that is only using latin1 strings and assuming latin1 url-encoded request body. My ugly test script is here: http://hank.org/utf8post.pl -- Bill Moseley mose...@hank.org