> It's been a few years since I looked at Xpath, 
> but I seem to recall that 
> it was originally inspired by SQL. 
> If that's correct, that doesn't 
> strike me as being very RegEx-like.

I think you're thinking of XQuery.  XPath is more RE-li[tk]e.

> But if I want to find the text inside B tags that happens to occur 
> before a comment tag containing a specific string, it may not be 
> possible to encode that in one query. But I haven't looked into it
yet...

I'm not sure how XML::Twig's XPath-like patterns or real XPath feel
about comments. 
(They might not see them at all, since they're non-semantic.)

> With screen scraping, you're just as likely to get semantic hints from

> the ordering of tags as you are from the ancestry. 

True enough, although that's anathema to XML. For such bad HTML, you may
have to resort to Perl Cookbook Recipe 20.18 but perhaps with
HTML::TokeParser::Simple (instead of HTML::TokeParser).

Be sure to read 
http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/lib/HTML/TokePa
rser/Simple.pm
under the is_comment() function. (Uses HTML::Parser or HTML::PullParser
under the hood, see Caveat section. You can use HTML::PullParser
directly to define your own token classes.)

Alternatively, Regexp::Common is frequently useful for parsing "hard"
things, but it only has $RE{comment}{html} so far, alas, the promised
Regexp::Common qw/html_tags/; has not been done yet. 

> And as with 
> traditional regular expression techniques for extracting data, you
often 
> want to find the last X that occurs before a Y, or other ordering 
> relationships, ignoring other aspects of structure in the document.

That used to be true. In the modern <DIV><SPAN>CSS world, you're
starting to get semantic markup in the CSS class/id attributes of the
DIV SPAN or other tags ... Which XML::Twig can match.

> language, that contains directives for not only indicating
parent-child 
> relationships, but also can also operate on the document as simply a 
> stream of tags, 

That may be beyond XPath. By design. 

-- Bill
Aka N1VUX etc
 
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to