Wide Character entities in ISO-8859-1 documents (was Re: Correction)

Oliver Block Thu, 15 Oct 2009 11:19:11 -0700

Oliver Block schrieb:
> (You will find the perl code at the end)
>
> A close look to the dump of $tree and a comparison with
> $response->content showed the following:
>
> The following markup from $response->content
>
> <td colspan="8" align="left" bgcolor="#FFFFFF" class="Rubrik">&raquo;
> Kontakt &nbsp;&rsaquo; Kontaktformular</td>
>
> appears in tree as
>
> bless( {
> '_parent' =>
> $VAR1->{'_content'}[1]{'_content'}[0]{'_content'}[1]{'_content'}[5],
> '_content' => [
>                          "\x{bb} Kontakt \x{a0}\x{203a} Kontaktformular"
>                        ],
> 'colspan' => '8',
> 'align' => 'left',
> 'bgcolor' => '#FFFFFF',
> '_tag' => 'td',
> 'class' => 'Rubrik'
> }, 'HTML::Element' )
>
> If you have any idea how to avoid the conversion to utf8 and how to
> assure the the output of $tree->as_HTML() can be saved in the same
> encoding as stated in $response, please tell it.
>
>   
I think I've found out what causes the problem. As I mentioned earlier
the content of a td tag in my case "&raquo; Kontakt &nbsp;&rsaquo;
Kontaktformular" will be represented by the following ...  characters
(?) "\x{bb} Kontakt \x{a0}\x{203a} Kontaktformular" and the reason seems
to be that there is nothing like a character representation in the 
ISO-8859-1 encoding. The codepoint (for &rsaquo;) is U+203A or &#8250;
This seems to be a legal character in ISO-8859-1-encoded html documents
when it appears in the form of a character entity reference.


So, changing the parameter for as_HTML from

$tree->as_HTML('<>&');


to

$tree->as_HTML();


solves the problem because now all "unsafe" characters (e.g. "\x{203a}")
are encoded as entities within as_HTML(). Therefore there is no need for
perl to encode the complete string to UTF-8 when using join() (see code
at the end). That's at least what perluniintro mentions:

"Internally, Perl currently uses either whatever the native eight-bit
character set of the platform (for example Latin-1) is, defaulting to
UTF-8, to encode Unicode strings. Specifically, if all code points in
the string are 0xFF or less, Perl uses the native eight-bit character
set.  Otherwise, it uses UTF-8." (perldoc perluniintro)

That's at least how I make sense of it.

Best regards,

Oliver Block



> Oliver Block schrieb:
>   
>> Hello everyone,
>>
>> the following code is used to load a web page from a certain web server
>> and parse it into an html tree. At the end a variable is assigned the
>> string representation of that tree.
>>
>>         use LWP::UserAgent;
>>         use HTML::TreeBuilder;
>>
>>         my $ua = LWP::UserAgent->new;
>>         my $response = $ua->get($form->{'url'});
>>
>>         my $tree = HTML::TreeBuilder->new();
>>         $tree->parse($response->content);
>>
>> # ...
>> # encoding of content of $tree is ISO-8859-1 at this point
>>         $template = $tree->as_HTML('<>&');
>>
>> # encoding of content of $template is UTF-8
>>
>> Now the following problem arises. The encoding of the content of
>> $template (UTF-8) is not the same than the content of $tree
>> (ISO-8859-1). So it is obvious, that as_HTML converts the encoding to UTF-8.
>>
>> I debugged everything and everythings is fine up to the last line of code of 
>> sub HTML::Element::as_HTML which is:
>>
>>   return join('', @html, "\n");
>>
>> This would mean that join seems to modify the encoding of the content.
>>
>> Any suggestions?
>>

Wide Character entities in ISO-8859-1 documents (was Re: Correction)

Reply via email to