On Friday, April 5, 2002, at 10:43 , Paul Tremblay wrote: [..] > The problem is that the filter deletes all of my text and ouputs this: > > [TABLE NOT SHOWN][TABLE NOT SHOWN][TABLE NOT SHOWN][TABLE NOT > SHOWN][TABLE NOT SHOWN]
Right! that is the big clue I should have seen - there is no 'plain html stuff' - it's all stuffed in tables.... I just ran the code against a webPage that is all one big form - with some table foo on the inside.... and got your equivolent response... > > I have tried it on five different files. All of these files were > from the same website. It appears that this module is broken. > That is, it can't handle certain html (which is valid when looked > at in a browser). I just smelled the coffee - all of the 'information' that you are looking for is being presented in Tables - and that in essence these clasess of webPages are little more than <HTML><HEAD><TITLE>SomeBuzzHere</TITLE></HEAD> <BODY BGCOLOR=#ffffff> Table, table table..... ..... maybe not even closed with </BODY></HTML> so what you want to do is something along the line of actually spin up some code like: my $page = ''; my @tables = $tree->look_down( "_tag", "table"); foreach my $tab (@tables) { my @Th_list = $tab->look_down("_tag", "th"); foreach my $t (@Th_list) { next unless($t); foreach my $item_r ( $t->content_refs_list ) { next if ref $$item_r; $page .= "$$item_r \n"; } } my @Tr_list = $tab->look_down("_tag", "tr" ); foreach my $tr (@Tr_list) { my @td_list = $tr->look_down("_tag", "td" ); foreach my $t (@td_list) { foreach my $item_r ( $t->content_refs_list ) { next if ref $$item_r; $page .= "$$item_r "; } } $page .= "\n" if (@td_list); } $page .= "#---------\n"; @Tr_list=(); } print $page ; so that you wind up sucking out the details from the table elements themselves ..... the problem is not really with: use HTML::Parser; use HTML::FormatText; use HTML::TreeBuilder; my $html_text; my $filename = $ARGV[0]; open(FH, $filename) or die "unable to open file $filename :$!\n"; while (<FH>) { $html_text .= $_ ; } ###my $plain_text = HTML::FormatText->new->format(parse_html($html_text)); my $tree = HTML::TreeBuilder->new->parse($html_text); my $plain_text = HTML::FormatText->new->format($tree); print "$plain_text\n"; #---- save that it can only do what it does - ciao drieux --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]