Oliver Block schrieb: > (You will find the perl code at the end) > > A close look to the dump of $tree and a comparison with > $response->content showed the following: > > The following markup from $response->content > > <td colspan="8" align="left" bgcolor="#FFFFFF" class="Rubrik">» > Kontakt › Kontaktformular</td> > > appears in tree as > > bless( { > '_parent' => > $VAR1->{'_content'}[1]{'_content'}[0]{'_content'}[1]{'_content'}[5], > '_content' => [ > "\x{bb} Kontakt \x{a0}\x{203a} Kontaktformular" > ], > 'colspan' => '8', > 'align' => 'left', > 'bgcolor' => '#FFFFFF', > '_tag' => 'td', > 'class' => 'Rubrik' > }, 'HTML::Element' ) > > If you have any idea how to avoid the conversion to utf8 and how to > assure the the output of $tree->as_HTML() can be saved in the same > encoding as stated in $response, please tell it. > > I think I've found out what causes the problem. As I mentioned earlier the content of a td tag in my case "» Kontakt › Kontaktformular" will be represented by the following ... characters (?) "\x{bb} Kontakt \x{a0}\x{203a} Kontaktformular" and the reason seems to be that there is nothing like a character representation in the ISO-8859-1 encoding. The codepoint (for ›) is U+203A or › This seems to be a legal character in ISO-8859-1-encoded html documents when it appears in the form of a character entity reference.
So, changing the parameter for as_HTML from $tree->as_HTML('<>&'); to $tree->as_HTML(); solves the problem because now all "unsafe" characters (e.g. "\x{203a}") are encoded as entities within as_HTML(). Therefore there is no need for perl to encode the complete string to UTF-8 when using join() (see code at the end). That's at least what perluniintro mentions: "Internally, Perl currently uses either whatever the native eight-bit character set of the platform (for example Latin-1) is, defaulting to UTF-8, to encode Unicode strings. Specifically, if all code points in the string are 0xFF or less, Perl uses the native eight-bit character set. Otherwise, it uses UTF-8." (perldoc perluniintro) That's at least how I make sense of it. Best regards, Oliver Block > Oliver Block schrieb: > >> Hello everyone, >> >> the following code is used to load a web page from a certain web server >> and parse it into an html tree. At the end a variable is assigned the >> string representation of that tree. >> >> use LWP::UserAgent; >> use HTML::TreeBuilder; >> >> my $ua = LWP::UserAgent->new; >> my $response = $ua->get($form->{'url'}); >> >> my $tree = HTML::TreeBuilder->new(); >> $tree->parse($response->content); >> >> # ... >> # encoding of content of $tree is ISO-8859-1 at this point >> $template = $tree->as_HTML('<>&'); >> >> # encoding of content of $template is UTF-8 >> >> Now the following problem arises. The encoding of the content of >> $template (UTF-8) is not the same than the content of $tree >> (ISO-8859-1). So it is obvious, that as_HTML converts the encoding to UTF-8. >> >> I debugged everything and everythings is fine up to the last line of code of >> sub HTML::Element::as_HTML which is: >> >> return join('', @html, "\n"); >> >> This would mean that join seems to modify the encoding of the content. >> >> Any suggestions? >>