On Friday, April 5, 2002, at 10:43 , Paul Tremblay wrote:
[..]
> The problem is that the filter deletes all of my text and ouputs this:
>
> [TABLE NOT SHOWN][TABLE NOT SHOWN][TABLE NOT SHOWN][TABLE NOT
> SHOWN][TABLE NOT SHOWN]
Right! that is the big clue I should have seen - there is no
'plain html stuff' - it's all stuffed in tables....
I just ran the code against a webPage that is all one big
form - with some table foo on the inside.... and got your
equivolent response...
>
> I have tried it on five different files. All of these files were
> from the same website. It appears that this module is broken.
> That is, it can't handle certain html (which is valid when looked
> at in a browser).
I just smelled the coffee - all of the 'information' that you
are looking for is being presented in Tables - and that in essence
these clasess of webPages are little more than
<HTML><HEAD><TITLE>SomeBuzzHere</TITLE></HEAD>
<BODY BGCOLOR=#ffffff>
Table, table table.....
.....
maybe not even closed with
</BODY></HTML>
so what you want to do is something along the line of actually
spin up some code like:
my $page = '';
my @tables = $tree->look_down( "_tag", "table");
foreach my $tab (@tables) {
my @Th_list = $tab->look_down("_tag", "th");
foreach my $t (@Th_list) {
next unless($t);
foreach my $item_r ( $t->content_refs_list ) {
next if ref $$item_r;
$page .= "$$item_r \n";
}
}
my @Tr_list = $tab->look_down("_tag", "tr" );
foreach my $tr (@Tr_list) {
my @td_list = $tr->look_down("_tag", "td" );
foreach my $t (@td_list) {
foreach my $item_r ( $t->content_refs_list ) {
next if ref $$item_r;
$page .= "$$item_r ";
}
}
$page .= "\n" if (@td_list);
}
$page .= "#---------\n";
@Tr_list=();
}
print $page ;
so that you wind up sucking out the details from the table elements
themselves .....
the problem is not really with:
use HTML::Parser;
use HTML::FormatText;
use HTML::TreeBuilder;
my $html_text;
my $filename = $ARGV[0];
open(FH, $filename) or die "unable to open file $filename :$!\n";
while (<FH>) { $html_text .= $_ ; }
###my $plain_text = HTML::FormatText->new->format(parse_html($html_text));
my $tree = HTML::TreeBuilder->new->parse($html_text);
my $plain_text = HTML::FormatText->new->format($tree);
print "$plain_text\n";
#----
save that it can only do what it does -
ciao
drieux
---
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]