[Imdbpy-devel] Status on DOM-based parsing

H. Turgut Uyar Wed, 11 Jun 2008 06:18:06 -0700

Hi,

As Davide has pointed out, I've been working on using DOM-style parsing 
for the IMDb pages. This will hopefully make it easier to maintain the 
code and easier to adapt to future changes in the IMDb html design. The 
idea is to represent the IMDb page as an xml tree and extract the 
information using xpath.


We've considered three xml processors (a nice summary of these can be 
found in 
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/):

BeautifulSoup is a pure Python parser which is light-weight and fast. It 
can also handle html pages gracefully even if they are not proper xml. 
But it does not support xpath as of now.

elementtree is similar but in order to handle html files it needs 
elementtidy which is C-based. elementtree has basic support for xpath 
which would probably not be enough and elements do not have links to 
their parents, so xpath expressions needing that would be hard to process.

lxml is very fast and has very good xpath support. But it has external 
dependencies, mostly requiring C.

What I did is to write an xpath parser for BeautifulSoup. At first it 
was not very structured, now I'm re-implementing it according to the 
xpath specification. I'm also keeping in sync with lxml, so if lxml is 
installed it will be used, if not we will fall back to beautifulsoup.

I'm hoping to have a better xpath beautifulsoup parser by tomorrow. I 
can send it to anyone interested. Any help, suggestions and 
recommendations are surely welcome.

- Turgut


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

[Imdbpy-devel] Status on DOM-based parsing

Reply via email to