Gary Nielson wrote:
>
> I can get by programming in Perl, but my head hurts trying to
> understand how object-oriented modules such as TokeParser work.
> Basically, I want to parse an html file where each entry looks like
> this:
>
> <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
> <A HREF="/rc/news/docs/07073706.htm">Junk DNA may not be such junk,
> genome studies find</A>
> </B></FONT></DT>
> <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>WASHINGTON -
> — The first in-depth look into the
> human genome shows it is much more complicated.. <P>
> </FONT></DD>
>
Here is one way to do it. Assume you have the following file
called 'sample.html'.
<html>
<head><title>Tutorial</title></head>
<body>
<dl>
<DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
<A HREF="/rc/news/docs/07073706.htm">Junk DNA may not be such junk,
genome studies find</A>
</B></FONT></DT>
<DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>WASHINGTON -
— The first in-depth look into the
human genome shows it is much more complicated.. <P>
</FONT></DD>
<DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
<A HREF="http://search.cpan.org/doc/GAAS/HTML-Parser-3.15/lib/HTML/TokeParse
The HTML::TokeParser is an
alternative interface to the HTML::Parser class.
</A>
</B></FONT></DT>
<DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>Sebastopol -
It basically turns the HTML::Parser inside out.
You associate a file (or any IO::Handle object or
string) with the parser at construction
time and then repeatedly call $parser->get_token
to obtain the tags and text found in the parsed
document.
<P>
</FONT></DD>
<DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
<A HREF="http://search.cpan.org/doc/GAAS/HTML-Parser-3.15/Parser.pm">
This is the new XS based HTML::Parser</A>
</B></FONT></DT>
<DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>Boston -
Objects of the HTML::Parser class will recognize
markup and separate it from plain text (alias data
content) in HTML documents. As different kinds of
markup and text are recognized, the corresponding
event handlers are invoked.
<p>
</FONT></DD>
<DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
<A HREF="http://search.cpan.org/doc/GAAS/libwww-perl-5.10/lib/HTML/TreeBuild
This is a parser that builds (and actually itself is) a HTML syntax tree.
</A>
</B></FONT></DT>
<DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>PITTSBURG -
Objects of this class inherit the methods of both
HTML::Parser and HTML::Element. After parsing has
taken place it can be regarded as the syntax tree
itself.
<P>
</FONT></DD>
</dl>
</body>
</html>
Run the following code.
use strict;
use HTML::TokeParser;
require 'dumpvar.pl';
my $p = HTML::TokeParser->new("sample.html");
my $rss;
while(my $token = $p->get_token) {
next unless $token->[0] eq 'S' and
$token->[1] eq 'dt';
my $rec = {};
while(my $token = $p->get_token) {
last if $token->[0] eq 'E' and
$token->[1] eq 'dd';
if($token->[0] eq 'S' and
$token->[1] eq 'a') {
$rec->{url} = $token->[2]{href};
$rec->{headline} = $p->get_trimmed_text('/a');
} elsif($token->[0] eq 'S' and
$token->[1] eq 'dd') {
$rec->{summary} = $p->get_trimmed_text('/dd');
}
}
push(@$rss,$rec);
}
#dumpValue(\$rss);
for my $rec (@$rss) {
print join('||',$rec->{url},$rec->{headline},$rec->{summary}),"\n\n";
}
__END__
The TokeParser parses an html document and gives you an array of
tokens to look through. The way you access this array of tokens
is through the various methods in the class. The tokens
themselves are represented by references to arrays.
The above program parses the document and begins the winnowing
process. The outer while loop rejects any token that is not a 'S'
(start) tag and has a name of 'dt'. Once the first <dt> tag is
found we create a hash ref that will hold the data for each
record found. We jump out if we see the closing </dt>. If we see
the starting <a> tag, grab the url, it is the third element in
the token which is a hash ref and we want the value who's key is
'href'. Then grab all the text up to the closing </a> tag. If we
see the <dd> tag then grab all the text up to the closing </dd>
tag. When we jump out, push $rec into an array and go back for
more.