Paul Kulchenko <[EMAIL PROTECTED]> writes:

> [Problem]
> 1. LWP::Protocol::http and others use length() to calculate
> content-length and in Perl 5.6.1 and later length() calculates chars
> instead of bytes. It means that every request that has multibyte
> chars in it will have wrong content-length and other side will read
> less bytes than required. 

I my view it is a bug to put content containing chars with ord() > 255
in the the content of a HTTP::Request.  If you want UTF8 encoded stuff
you should put UTF8 encoded stuff in the content.  Don't expect perl
to magically guess.  You should use Encode::encode_utf8($str) or
something like it.

If there was an easy way I would like to add a

  sv_utf8_downgrade($req->content, 0);

to the LWP::Protocol code.  This would make requests with such chars
in them fail early.  I think the write call on the socket ought to do
the downgrade and croaking for me though.

> 2. LWP::Protocol::http::request overwrites Content-length header even
> if application properly specifies it. It's easy to fix (below), but
> it doesn't help much, because length() is used in syswrite/sysread
> calls to calculate size. See problem 1.
> 
> [Solution]
> Problem 2 is easy to fix:
> 
>       $h->header('Content-Length' => length $$cont_ref)
>               if defined($$cont_ref) && length($$cont_ref);
> 
> should be
> 
>       $h->header('Content-Length' => length $$cont_ref)
>               if !defined($h->header('Content-Length')) &&
>                    defined($$cont_ref) && length($$cont_ref);

I think overriding Content-Length is the right thing to do.

> Problem 1 is more complex. 'use bytes' is lexically scoped, hence
> doesn't help outside of LWP::Protocol::http. eval "use bytes"; can be
> used to fix it inside LWP::Protocol::http and others.
> 
> On application level the only solution I can come up with is this:
> 
> BEGIN { 
>   sub bytelength; 
>   eval ( eval('use bytes; 1') # 5.6.0 and later?
>     ? 'sub bytelength { use bytes; length(@_ ? $_[0] : $_) }; 1'
>     : 'sub bytelength { length(@_ ? $_[0] : $_) }; 1' 
>   ) or die;
> }
> 
> # drop UTF mark
> $str = pack('C0A*', $str) if length($str) != bytelength($str);
> 
> Ideally I would like to have it fixed in LWP::Protocol. btw, how
> quick is pack 'C0A*'?

It will certainly have to copy the string.  I think it would be better
to try to use one of functions the Encode module provides.

Regards,
Gisle

Reply via email to