HTML queries (was: HTML::Element screen-scraping)

Reinier Post Wed, 31 May 2000 03:30:17 -0700
> On a broader topic, I've been thinking of extending the
> find_by_attributes method to something more general, such that one
> could write things like:
> 
>   my @matching = $h->magic_scanner(
>     '_tag' => 'p',
>     'class' => 'restInfo',
>     sub { scalar( $_->find_by_attribute('class', 'restName') },
>     qr/mm-mm-good!/,
>      # maybe that should mean that the same as:
>      #  sub { $_->as_text() =~ m/mm-mm-good!/; }
>   );


This is an attempt to abstract away from the details of traversal,
but it can be taken one step further: implement a query language!

I.e. what you really want to write is something like

  my @matching = $h->find_by_query(

    '<P CLASS="restInfo"> containing <SPAN CLASS="restName">
                                containing /mm-mm-good!/'

  );

I actually have a specification for this language on paper, done as a
mental exercise.  My intention is to implement parts of it on top of
HTML::Element, but it may never be finished.  Other HTML query
languages have been published in the literature that could be used.

> Or maybe that's the /first/ thing that needs doing -- while traverse()
> is very general, maybe what most people /mean/ by using it most of the
> time could be done more intuitively with something more like the
> above.
>
> (Alternately: "Of course, at some point this just turns into
> find(1)...", with -prune and -o and whatnot.)

What I'm thinking of more is XSLT/XPath.

> As to your crypto-code like:
> 
> >     $p->content =~ /<B>Cuisine:</B> (.*) <BR>/;
> >     $rest{cuisine} = $1;
> 
> ...this can be expressed in terms of tree structure as: find 'b'
> elements with one text node (consisting of cuisine) as a child, and
> then looking at its right sister node, which should be text...

Yes, but it would be nice to use a HTML-like notation,
while employing the benefits of real HTML parsing.

> --  
> Sean M. Burke    [EMAIL PROTECTED]    http://www.spinn.net/~sburke/
 
-- 
Reinier Post                                     [EMAIL PROTECTED]
HTML queries (was: HTML::Element screen-scraping)

Reply via email to