Paul Kulchenko <[EMAIL PROTECTED]> writes:

> Hi, Gisle!

Hi, Paul!

> --- Gisle Aas <[EMAIL PROTECTED]> wrote:
> > I my view it is a bug to put content containing chars with ord() > 255
> > in the the content of a HTTP::Request.  If you want UTF8 encoded stuff
> > you should put UTF8 encoded stuff in the content.  Don't expect perl
> > to magically guess.  You should use Encode::encode_utf8($str) or
> > something like it.
>
> I'm not sure I follow you. I do have my string utf8 encoded using
> Perl capabilities. What do you mean "is a bug to put content
> containing chars with ord() 255 in the the content of a
> HTTP::Request"? What should I do then if I have utf8 encoded string
> to send? I expect LWP will handle it properly on wire.

If you have a utf8 encoded string then none of the chars in the string
will have ord() > 255.  It is the kind of string that
Encode::encode_utf8() would produce.

What you want is LWP to deal with strings with the _internal_ UTF8
flag set.  In my view LWP can't guess what encoding to apply to
serialize that kind of string.  UTF-8 is not really a more obvious
choice than UTF-16.  If it happens to be an image that somehow got the
UTF8 flag set then any UTF-encoding would be wrong, as the string
should simply be utf8_downgraded to be ok again.

> > If there was an easy way I would like to add a
> >   sv_utf8_downgrade($req->content, 0);
> > to the LWP::Protocol code.  This would make requests with such
> > chars in them fail early.  
> I don't understand why they should be failed. What's wrong with this:
> 
> $utf8 = pack('U*', unpack('C*', $something_russian_latin1_encoded));
> $req = HTTP::Request
>   ->new(POST => $endpoint, HTTP::Headers->new, $utf8);
> $resp = LWP::UserAgent->new->request($req);
> 
> request won't be properly encoded in 5.6.1 and later unless you drop
> utf8 mark from $utf8. I do need to have utf8 encoding on wire. 
> 
> What do you expect me to do?

If you want the string UTF8 encoded, then say so explicitly:

 $req = HTTP::Request->new(POST => $endpoint);
 $req->content_type("text/plain; charset='utf8'");
 $req->content(Encode::encode_utf8($utf8));

> > > # drop UTF mark
> > > $str = pack('C0A*', $str) if length($str) != bytelength($str);
> > > 
> > > Ideally I would like to have it fixed in LWP::Protocol. btw, how
> > > quick is pack 'C0A*'?
> > 
> > It will certainly have to copy the string.  I think it would be
> > better to try to use one of functions the Encode module provides.
>
> I need it to work with all Perls starting 5.005. Encode wasn't
> available in 5.6.x, was it?

That is true.  How about something like this (untested);

   eval { require Encode; };
   if ($@) {
      # replacement
      *Encode::encode_utf8 = sub { pack('C0A*', shift) };     
   }

Actually, I think pack is buggy if this works.  The A* really ought to
downgrade the string it packs and croak if this is not possible.
   
Regards,
Gisle

Reply via email to