TRANS wrote:
> On 7/18/06, Yann Klis <[EMAIL PROTECTED]> wrote:
>> If you'd like to do HTML parsing, you'd better use RubyfoulSoup
>> http://www.crummy.com/software/RubyfulSoup/ which is a port of
>> BeautifulSoup to Python, and IMHO better adapted to HTML parsing than
>> libxml.
>
> Isn't that REXML based though? Be that is it may, I think his point
> was that libxml can handle HTML if asked to do so? So why not make
> that an option. Is that right?

Yes. The functionality is already in libxml, it just has to be exposed in
the Ruby bindings.

Why use XPath for HTML?

Using XPath for parsing HTML is gaining popularity in many languages.
There's good reason for it. XPath is like regular expressions for tree
structures: it's extremely powerful and once learned, it can be used for
many other tasks (XSLT, XQuery, etc).

Another advantage is that XPath expressions are just strings. And anything
you want to extract can virtually always be expressed in just one XPath
string. This makes them externalizable--create a config file with your
XPath information. If the page you're scraping changes, simply update the
file; no need to change any code.

And finally, there are plenty of nifty helping tools. Check out the
'XPather' and 'XPath Checker' extensions to Firefox. Select any part of a
page, and Firefox will give you the XPath for it. Manipulate the XPath to
fine-tune it and see the results live. It's really cool.

So even though I *could* use a different HTML parser (like RubyfulSoup), I
can't bring myself to do it. It wouldn't be as powerful, as standardized,
or as familiar to me as XPath. I really feel that the future of tag-based
parsing is XPath, and I'd hate to see Ruby fall behind in this respect.

I realize that Tidy + REXML is a potential workaround, but it's a bit of a
hack and Tidy barfs on some bad HTML.

- Mark.

(Aitor: I'll try your stuff and let you know how it goes. Thanks!)



_______________________________________________
libxml-devel mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/libxml-devel

Reply via email to