"Подбельский В.В." <[EMAIL PROTECTED]> writes:

> While creating a bot for LJ i've encountered with the following error:
> "Parsing of undecoded UTF-8 will give garbage when decoding entities at 
> D:/perl.5.8.8/site/lib/LWP/Protocol.pm line 114."
> 
> Here's the piece of code, which reproduces the error:
> ###
> my $lj_url = 'http://www.livejournal.com/';
> my $cj = HTTP::Cookies->new();
> my $ua = LWP::UserAgent->new(agent => 'Howdy?', cookie_jar => $cj);
> $ua->default_header('Accept-Language' => 'ru, en',
>                     'Accept-Charset'  => 'utf-8;q=1, *;q=0.1',
>                     'Referer'         => $lj_url);
> print "Getting the login form...\n";
> $res = $ua->get($lj_url . 'login.bml?nojs=1');
> exit;
> 
> It's said in HTML::HeadParser's pod that:
> "Note that the HTML::HeadParser might get confused if raw undecoded
> UTF-8 is passed to the parse() method.  Make sure the strings are
> properly decoded before passing them on."
> 
> And error seems to be on this line in LWP's Protocol.pm:
> 114: $parser->parse($$content) or undef($parser);
> If i make a change, so the content gets decoded before being parsed:
> $parser->parse(decode_utf8($$content)) or undef($parser);
> the error message fades away.
> 
> Is there's something i'm doing wrong or is it really a bug?

Yes, it's a bug.  The data we feed the $parser here should really be
decoded in a similar way to what the 'decoded_content' method of
HTTP::Message provide.  If you for instance send requests with
'Accept-Encoding: gzip' then LWP might end up feeding binary stuff to
the parser.  What LWP needs is to set up some decoding pipeline that
can decode content as it is received in chunks.  I have not gotten
around it it yet :)

A workaround might be to just disable this head-parsing business, with:

   $ua = LWP::UserAgent->new(...., parse_head => 0);

or by calling:

   $ua->parse_head(0);

after the $ua object has been constructed.  The most important
downside is that the $response->base might not be accurate, but you
might not care about that.

--Gisle
  • LWP & UTF-8. Подбельский В . В .
    • Re: LWP & UTF-8. Gisle Aas

Reply via email to