"Подбельский В.В." <[EMAIL PROTECTED]> writes:
> While creating a bot for LJ i've encountered with the following error: > "Parsing of undecoded UTF-8 will give garbage when decoding entities at > D:/perl.5.8.8/site/lib/LWP/Protocol.pm line 114." > > Here's the piece of code, which reproduces the error: > ### > my $lj_url = 'http://www.livejournal.com/'; > my $cj = HTTP::Cookies->new(); > my $ua = LWP::UserAgent->new(agent => 'Howdy?', cookie_jar => $cj); > $ua->default_header('Accept-Language' => 'ru, en', > 'Accept-Charset' => 'utf-8;q=1, *;q=0.1', > 'Referer' => $lj_url); > print "Getting the login form...\n"; > $res = $ua->get($lj_url . 'login.bml?nojs=1'); > exit; > > It's said in HTML::HeadParser's pod that: > "Note that the HTML::HeadParser might get confused if raw undecoded > UTF-8 is passed to the parse() method. Make sure the strings are > properly decoded before passing them on." > > And error seems to be on this line in LWP's Protocol.pm: > 114: $parser->parse($$content) or undef($parser); > If i make a change, so the content gets decoded before being parsed: > $parser->parse(decode_utf8($$content)) or undef($parser); > the error message fades away. > > Is there's something i'm doing wrong or is it really a bug? Yes, it's a bug. The data we feed the $parser here should really be decoded in a similar way to what the 'decoded_content' method of HTTP::Message provide. If you for instance send requests with 'Accept-Encoding: gzip' then LWP might end up feeding binary stuff to the parser. What LWP needs is to set up some decoding pipeline that can decode content as it is received in chunks. I have not gotten around it it yet :) A workaround might be to just disable this head-parsing business, with: $ua = LWP::UserAgent->new(...., parse_head => 0); or by calling: $ua->parse_head(0); after the $ua object has been constructed. The most important downside is that the $response->base might not be accurate, but you might not care about that. --Gisle