Gary Nielson wrote:
> 
> I can get by programming in Perl, but my head hurts trying to
> understand how object-oriented modules such as TokeParser work.
> Basically, I want to parse an html file where each entry looks like
> this:
> 
> <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
> <A HREF="/rc/news/docs/07073706.htm">Junk DNA may not be such junk,
> genome studies find</A>
> </B></FONT></DT>
> <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>WASHINGTON -
> &#0151; The first in-depth look into the
> human genome shows it is much more complicated.. <P>
> </FONT></DD>
> 

Here is one way to do it. Assume you have the following file
called 'sample.html'.

<html>
<head><title>Tutorial</title></head>
<body>

<dl>
    <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
    <A HREF="/rc/news/docs/07073706.htm">Junk DNA may not be such junk,
    genome studies find</A>
    </B></FONT></DT>
    <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>WASHINGTON -
    &#0151; The first in-depth look into the
    human genome shows it is much more complicated.. <P>
    </FONT></DD>

    <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
    <A HREF="http://search.cpan.org/doc/GAAS/HTML-Parser-3.15/lib/HTML/TokeParse
    The HTML::TokeParser is an
    alternative interface to the HTML::Parser class.
    </A>
    </B></FONT></DT>
    <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>Sebastopol -
    It basically turns the HTML::Parser inside out.
    You associate a file (or any IO::Handle object or
    string) with the parser at construction
    time and then repeatedly call $parser->get_token
    to obtain the tags and text found in the parsed
    document.
    <P>
    </FONT></DD>

    <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
    <A HREF="http://search.cpan.org/doc/GAAS/HTML-Parser-3.15/Parser.pm">
    This is the new XS based HTML::Parser</A>
    </B></FONT></DT>
    <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>Boston -
    Objects of the HTML::Parser class will recognize
    markup and separate it from plain text (alias data
    content) in HTML documents. As different kinds of
    markup and text are recognized, the corresponding
    event handlers are invoked.
    <p>
    </FONT></DD>
    <DT><FONT FACE="Arial, Helvetica, sans-serif"><B>
    <A HREF="http://search.cpan.org/doc/GAAS/libwww-perl-5.10/lib/HTML/TreeBuild
    This is a parser that builds (and actually itself is) a HTML syntax tree.
    </A>
    </B></FONT></DT>
    <DD><FONT FACE="Arial, Helvetica, sans-serif" SIZE=2>PITTSBURG -
    Objects of this class inherit the methods of both
    HTML::Parser and HTML::Element. After parsing has
    taken place it can be regarded as the syntax tree
    itself.
     <P>
    </FONT></DD>

</dl>
</body>
</html>

Run the following code.

use strict;
use HTML::TokeParser;
require 'dumpvar.pl';

my $p = HTML::TokeParser->new("sample.html");
my $rss;

while(my $token = $p->get_token) {
    next unless $token->[0] eq 'S' and
        $token->[1] eq 'dt';
    my $rec = {};
    while(my $token = $p->get_token) {
        last if $token->[0] eq 'E' and
            $token->[1] eq 'dd';
        if($token->[0] eq 'S' and
                $token->[1] eq 'a') {
            $rec->{url} = $token->[2]{href};
            $rec->{headline} = $p->get_trimmed_text('/a');
        } elsif($token->[0] eq 'S' and
                $token->[1] eq 'dd') {
            $rec->{summary} = $p->get_trimmed_text('/dd');
        }
    }
    push(@$rss,$rec);
}
#dumpValue(\$rss);

for my $rec (@$rss) {
    print join('||',$rec->{url},$rec->{headline},$rec->{summary}),"\n\n";
}

__END__
The TokeParser parses an html document and gives you an array of
tokens to look through. The way you access this array of tokens
is through the various methods in the class. The tokens
themselves are represented by references to arrays.

The above program parses the document and begins the winnowing
process. The outer while loop rejects any token that is not a 'S'
(start) tag and has a name of 'dt'. Once the first <dt> tag is
found we create a hash ref that will hold the data for each
record found. We jump out if we see the closing </dt>. If we see
the starting <a> tag, grab the url, it is the third element in
the token which is a hash ref and we want the value who's key is
'href'. Then grab all the text up to the closing </a> tag. If we
see the <dd> tag then grab all the text up to the closing </dd>
tag. When we jump out, push $rec into an array and go back for
more.

Reply via email to