Re: HTML queries (was: HTML::Element screen-scraping)

Sean M. Burke Wed, 31 May 2000 11:23:03 -0700
Since I posted last night I've thought a bit about this business of
how to say "give me the nodes in a tree that meet these criteria" (at
least in list context; in scalar context, I do still rather like the
meaning "give me the FIRST node in a tree that meets these criteria").

While I, in principle, do like R Post's idea of a query language, I've
been glancing idly at other tree-query lgs for a while now (whether
for HTML, XML, or other tree formalisms, such as one has in most
linguistic syntax frameworks), and I've yet to have any of them make
me jump up and say "yes, THAT is THE way to do it!".  As opposed to
just /yet another/ way.

And as much of a fan as I (think I) am of little languages, embedding
a query lg in HTML::Element would mean there's /two/ things to learn
-- how to say X in terms of method calls on HTML::Element objects, and
how to say X in terms of the query language.

In any case, I think I want to avoid what happened to traverse() --
that, in my hands, it became this unruly mess with too many calling
and returning syntaxes and caveats, and having users learn that-all
distracts them, I think, from the business of actually writing a
routine that involves traversing the tree.  I think what I should have
done is to have actually removed traverse() from the docs altogether,
and replaced it with notes on how to write your own traverser, and
esp. how to write an efficient pre-order traverser.  Not that I'd
remove the method itself -- but I wouldn't want it to be quite the
"attrictive nuisance" it now it.  In any case, I may be able to achieve
the same effect by revising the wording of the existing documentation
for it.  And, of course, adding more examples.

Or, alternately, what I really want to do is to come up with add a bit
of a cookbook to the Element docs, illustrating how to do some things.

But anyhoo, at least traverse() is nice and fast, and supports
terminating the traversal without having to die-and-catch or something
ugly like that.  And speaking of ugly, if anyone wants to see a
cautionary tale of why recursive algorithms are best implemented with
recursive code, have a look at the source for traverse.  Seized with a
fit of devil-may-care attitude, I wrote an iterative implementation of
the recursive algorithm -- to avoid the overhead of sub calls and
allocating a frame for each sub instance, etc.  It was nice to HAVE
done (sort of like passing a kidney stone), but nnnnever again!
Hopefully, tho, that'll be the Last Traverser You'll Ever Need.

> > As to your crypto-code like:
> > 
> > >   $p->content =~ /<B>Cuisine:</B> (.*) <BR>/;
> > >   $rest{cuisine} = $1;
> > 
> > ...this can be expressed in terms of tree structure as: find 'b'
> > elements with one text node (consisting of cuisine) as a child, and
> > then looking at its right sister node, which should be text...
> 
> Yes, but it would be nice to use a HTML-like notation,
> while employing the benefits of real HTML parsing.

In this case,
 $rest{'cuisine'} = $1 if $p->as_HTML() =~ m/<b>Cuisine:</b>(.*?)/i;

...which reminds me I've been meaning to have a look at as_HTML's
internals.

-- 
Sean M. Burke    [EMAIL PROTECTED]    http://www.spinn.net/~sburke/
Re: HTML queries (was: HTML::Element screen-scraping)

Reply via email to