> It's been a few years since I looked at Xpath, > but I seem to recall that > it was originally inspired by SQL. > If that's correct, that doesn't > strike me as being very RegEx-like.
I think you're thinking of XQuery. XPath is more RE-li[tk]e. > But if I want to find the text inside B tags that happens to occur > before a comment tag containing a specific string, it may not be > possible to encode that in one query. But I haven't looked into it yet... I'm not sure how XML::Twig's XPath-like patterns or real XPath feel about comments. (They might not see them at all, since they're non-semantic.) > With screen scraping, you're just as likely to get semantic hints from > the ordering of tags as you are from the ancestry. True enough, although that's anathema to XML. For such bad HTML, you may have to resort to Perl Cookbook Recipe 20.18 but perhaps with HTML::TokeParser::Simple (instead of HTML::TokeParser). Be sure to read http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/lib/HTML/TokePa rser/Simple.pm under the is_comment() function. (Uses HTML::Parser or HTML::PullParser under the hood, see Caveat section. You can use HTML::PullParser directly to define your own token classes.) Alternatively, Regexp::Common is frequently useful for parsing "hard" things, but it only has $RE{comment}{html} so far, alas, the promised Regexp::Common qw/html_tags/; has not been done yet. > And as with > traditional regular expression techniques for extracting data, you often > want to find the last X that occurs before a Y, or other ordering > relationships, ignoring other aspects of structure in the document. That used to be true. In the modern <DIV><SPAN>CSS world, you're starting to get semantic markup in the CSS class/id attributes of the DIV SPAN or other tags ... Which XML::Twig can match. > language, that contains directives for not only indicating parent-child > relationships, but also can also operate on the document as simply a > stream of tags, That may be beyond XPath. By design. -- Bill Aka N1VUX etc _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

