Re: [Boston.pm] HTML parsing

Tom Metro Wed, 21 Mar 2007 09:50:26 -0800

William Ricker wrote:
>> ...I seem to recall that it was originally inspired by SQL. 
> 
> I think you're thinking of XQuery.


Indeed.


> Be sure to read 
> http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/lib/HTML/TokeParser/Simple.pm
> under the is_comment() function.

You mean this?

   is_comment()

   Are you still reading this? Nobody reads POD. Don't you know you're
   supposed to go to CLPM, ask a question that's answered in the POD and
   get flamed? It's a rite of passage.

   Really.

:-)

HTML::TokeParser::Simple does clean up the ugliness of HTML::TokeParser 
(which I've dismissed using in the past due to it's primitive feeling 
API that requires string matching) with a more OO style API, and even 
provides a look-ahead capability with the peek() method, but it isn't 
fundamentally different from the solutions I listed in my original 
email. I'd consider it if the other suggested alternatives fail to do 
the job.


> Alternatively, Regexp::Common is frequently useful for parsing "hard"
> things, but it only has $RE{comment}{html} so far, alas, the promised
> Regexp::Common qw/html_tags/; has not been done yet. 

Undoubtedly because it is widely recognized that regular expressions 
aren't the right tool for tokenizing raw tags.

While I'm seeking something that is regular expression-like, it would 
need to be a language layered on top of an HTML parser, which would do 
the usual job or normalizing the data and extracting the structural 
relationships.

But Regexp::Common does suggest a possible standard upon which an 
RE-like language syntax could be built. Another possibility would be to 
use an HTML parser to transform a document into a format that could be 
acted upon by Perl's built-in RE engine, and then use Regexp::Common to 
extend the RE syntax. Where this approach runs into problems is the 
parent-child relationships.


> In the modern <DIV><SPAN>CSS world, you're starting to get semantic
> markup in the CSS class/id attributes of the DIV SPAN or other tags
> ...

Yup. With some documents there is a very good mapping of semantics to 
CSS class names. But of course it is hit-or-miss as to whether the 
document you want to scrape happens to have been constructed that way.

  -Tom

-- 
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/
 
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] HTML parsing

Reply via email to