> If the pages your working on are well-formed HTML, you may be troubled by
> a more severe problem: HTML::Parser and HTML::TreeBuilder are expected
> to leave non-broken HTML exactly the way it is, but they don't always
> do so.  There are problems with handling framesets; perhaps there are
> other problems.  If you find any, they should really be fixed.
...
> Can you post a *minimal* HTML fragment that exhibits the problem?

I finally managed to find out, what modification the original pages
need, to be properly processed by HTML::Parser: A closing </tr> followed
by a new <table> should mean: </tr></table><table>.
The problem is, that every browser I tried (except lynx) presents the
following HTML-Code, as if the tags, that are commented out, were
present:

HTML-Source:
<html><head> <title> Test tables </title> </head><body>
<table>
<tr>
<td> 
<table>
<tr><td>1.1</td></tr>
<!-- </table> -->
<table>
<tr><td>2.1</td></tr>
</table>
</td>
<td>
<table>
<tr><td>1.2</td></tr>
<!-- </table> -->
<table>
<tr><td>2.2</td></tr>
</table>
</td>
</tr>
</table>
</body>
</html>

However, reading and writing this HTML using the following script, gives
a different output and one that is interpreted differently by the
browsers.

#!/usr/bin/perl
use strict;
use HTML::TreeBuilder 3;        

my $tree = HTML::TreeBuilder->new();
$tree->no_space_compacting(1);
$tree->ignore_ignorable_whitespace(0);
$tree->store_comments(1);
$tree->parse_file($ARGV[0]);
open (OUT, ">n$ARGV[0]");
print OUT $tree->as_HTML;
close(OUT);
$tree->delete();

HTML-Output:
<html><head> <title> Test tables </title> </head><body>
<table>
<tr>
<td> 
<table>
<tr><td>1.1</td></tr>
<!-- </table> -->
<tr><td><table>
<tr><td>2.1</td></tr>
</table>
</td>
<td>
<table>
<tr><td>1.2</td></tr>
<!-- </table> -->
<tr><td><table>
<tr><td>2.2</td></tr>
</table>
</td>
</tr>
</table>
</td></tr></table></td></tr></table></body>

</html>

This means, if there is a row closing (</tr>) on an open table, followed
by a new table, HTML::Parser assumes this to be a new dataelement,
inserting <tr><td>, most browsers (in fact every browser I tried, except
lynx) however interpret this to mean the end of the open table.
I think this should be changed, or at least, that there should be some
switch to enable this behaviour, although, of course, this is not proper
HTML.
I will now take a look at start() to see, how this could be done.

Neven Luetic


Reply via email to