William Ricker wrote: >> ...I seem to recall that it was originally inspired by SQL. > > I think you're thinking of XQuery.
Indeed. > Be sure to read > http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/lib/HTML/TokeParser/Simple.pm > under the is_comment() function. You mean this? is_comment() Are you still reading this? Nobody reads POD. Don't you know you're supposed to go to CLPM, ask a question that's answered in the POD and get flamed? It's a rite of passage. Really. :-) HTML::TokeParser::Simple does clean up the ugliness of HTML::TokeParser (which I've dismissed using in the past due to it's primitive feeling API that requires string matching) with a more OO style API, and even provides a look-ahead capability with the peek() method, but it isn't fundamentally different from the solutions I listed in my original email. I'd consider it if the other suggested alternatives fail to do the job. > Alternatively, Regexp::Common is frequently useful for parsing "hard" > things, but it only has $RE{comment}{html} so far, alas, the promised > Regexp::Common qw/html_tags/; has not been done yet. Undoubtedly because it is widely recognized that regular expressions aren't the right tool for tokenizing raw tags. While I'm seeking something that is regular expression-like, it would need to be a language layered on top of an HTML parser, which would do the usual job or normalizing the data and extracting the structural relationships. But Regexp::Common does suggest a possible standard upon which an RE-like language syntax could be built. Another possibility would be to use an HTML parser to transform a document into a format that could be acted upon by Perl's built-in RE engine, and then use Regexp::Common to extend the RE syntax. Where this approach runs into problems is the parent-child relationships. > In the modern <DIV><SPAN>CSS world, you're starting to get semantic > markup in the CSS class/id attributes of the DIV SPAN or other tags > ... Yup. With some documents there is a very good mapping of semantics to CSS class names. But of course it is hit-or-miss as to whether the document you want to scrape happens to have been constructed that way. -Tom -- Tom Metro Venture Logic, Newton, MA, USA "Enterprise solutions through open source." Professional Profile: http://tmetro.venturelogic.com/ _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

