Re: New HTML::TokeParser Interface

Curtis Poe Mon, 04 Feb 2002 09:55:48 -0800

----- Original Message -----
From: "Gisle Aas" <[EMAIL PROTECTED]>
> > Want to print all comments in an HTML doc?
> >
> >      my $p = HTML::TokeParser::Easy->new( $doc );
> >      while ( my $token = $p->get_token )
> >      {
> >          next if ! $p->is_comment( $token );
> >          print PHB $p->return_text( $token ), "\n";
> >      }
>
> How is actually 'return_text' here different from the old 'get_text'
> that was already provided?


In HTML::TokeParser, you have the following attributes for token types:

     ["S",  $tag, $attr, $attrseq, $text]
     ["E",  $tag, $text]
     ["T",  $text, $is_data]
     ["C",  $text]
     ["D",  $text]
     ["PI", $token0, $text]

The third, in the list above, is the "text" returned by "get_text".  In
other words, IIRC, this is text that is visible on the Web page.  However,
all of the tags have an attribute that is "$text".  This is the exact text
of the returned token.  In Easy.pm, what I did was take the above
information, stuff it into a hash with a bit of identifying information and
add an AUTOLOAD sub that generates the appropriate methods on the fly.
Thus, $text is what "return_text()" returns.  I preferred something like
get_attr() and get_text(), but that overrode the original get_text() method
:(

To keep things clear, I used the exact text from the above list.  For
example, here's on key in the hash:

 S => {
  _name   => 'START_TAG',
  tag     => 1,
  attr    => 2,
  attrseq => 3,
  text    => 4
 }

> >         next if $p->is_comment( $token );
> >         print PHB $p->return_text( $token );

Part of the reason why I like this interface is because without it, the
above two lines were originally:

         next if $token->[ 0 ] eq 'C';
         print PHB $token->[ 1 ];

Since I am a huge fan of trying to make "intuitive" interfaces, I just
didn't care to try and remember what all of the array elements were.

> I think blessing of the tokens might have merit.  I also think that
> HTML::TokeParser (and HTML::PullParser) should have some kind of
> support for this.

[snip]

Well, rather than traipsing too far down this road, perhaps just offering up
the module for inspection is better.  The distribution is at
http://www.easystreet.com/~ovid/cgi_course/downloads/HTML-TokeParser-Easy-1.
0.tar.gz

I haven't added any tests as I wasn't really sure if I wouldn't be wasting
my time.  However, there is complete POD, so understanding what I did and
why should be fairly clear.  It also has some sample programs in the POD.  I
don't think that Easy.pm is appropriate for all TokeParser programs, but it
really makes things clearer for those in which it is a good fit -- ugh, was
that an awkward sentence, or what? :)

--
Curtis "Ovid" Poe, Senior Programmer, ONSITE! Technology
Someone asked me how to count to 10 in Perl:
push @A, $_ for reverse q.e...q.n.;for(@A){$_=unpack(q|c|,$_);@a=split//;
shift @a;shift @a if $a[$[]eq$[;$_=join q||,@a};print $_,$/for reverse @A

Re: New HTML::TokeParser Interface

Reply via email to