Forwarded conversation
Subject: HTML::Element as_HTML encoding problem
------------------------

From: *Oliver Block* <li...@oliver-block.eu>
Date: Wed, Oct 14, 2009 at 9:00 PM
To: lib...@perl.org


Hello everyone,

the following code is used to load a web page from a certain web server
and parse it into an html tree. At the end a variable is assigned the
string representation of that tree.

       use LWP::UserAgent;
       use HTML::TreeBuilder;

       my $ua = LWP::UserAgent->new;
       my $response = $ua->get($form->{'url'});

       my $tree = HTML::TreeBuilder->new();
       $tree->parse($response->content);

# ...
# encoding of content of $tree is ISO-8859-1 at this point
       $template = $tree->as_HTML('<>&');

# encoding of content of $template is UTF-8

Now the following problem arises. The encoding of the content of
$template (UTF-8) is not the same than the content of $tree
(ISO-8859-1). So it is obvious, that as_HTML converts the encoding to UTF-8.

I debugged everything and everythings is fine up to the last line of code of
sub HTML::Element::as_HTML which is:

 return join('', @html, "\n");

This would mean that join seems to modify the encoding of the content.

Any suggestions?


Best Regards,

Oliver Block



----------
From: *Bill Moseley* <mose...@hank.org>
Date: Thu, Oct 15, 2009 at 12:55 PM
To: Oliver Block <li...@oliver-block.eu>
Cc: lib...@perl.org


I'm not really sure what the problem is, sorry.  But, the terminology above
seems a bit off.

UTF-8 and ISO-8859-1 are encodings (encoded octets) not characters.
Characters are an abstractions.  You should use character's inside Perl and
encoded octets outside.  (Ignore the fact that Perl's internal encoding is
UTF-8 and just pretend they are character abstractions.)

So, in general, I would bring character data into Perl like:

my $characters = $response->decoded_content;

Then you work with $characters as needed.

And then when you want to output you convert back to whatever encoding you
need:

$utf8_octets = encode_utf8( $characters );

send_to_client( $utf8_octets );

For your case you might try $tree->parse( $response->decoded_content );  Or,
if you have raw utf-8 octets that you need to parse I think you can call
$tree->utf8_mode( 1 ) to tell the parser to decode.  But, I'd prefer the
first.

(One thing I'm not clear on is when or if the parsers detect encoding by
looking for a charset in the content.  XML::LibXML will use the <?xml
encoding= from the content, for example.  But I'm not clear if the
HTML::Parser will look at an encoding set in a <meta> tag.)






--
Bill Moseley
mose...@hank.org

----------
From: *Terrence Brannon* <scheme...@gmail.com>
Date: Thu, Oct 15, 2009 at 1:15 PM
To: seamstress-disc...@lists.sf.net
------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
seamstress-discuss mailing list
seamstress-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/seamstress-discuss

Reply via email to