Hi, As Davide has pointed out, I've been working on using DOM-style parsing for the IMDb pages. This will hopefully make it easier to maintain the code and easier to adapt to future changes in the IMDb html design. The idea is to represent the IMDb page as an xml tree and extract the information using xpath.
We've considered three xml processors (a nice summary of these can be found in http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/): BeautifulSoup is a pure Python parser which is light-weight and fast. It can also handle html pages gracefully even if they are not proper xml. But it does not support xpath as of now. elementtree is similar but in order to handle html files it needs elementtidy which is C-based. elementtree has basic support for xpath which would probably not be enough and elements do not have links to their parents, so xpath expressions needing that would be hard to process. lxml is very fast and has very good xpath support. But it has external dependencies, mostly requiring C. What I did is to write an xpath parser for BeautifulSoup. At first it was not very structured, now I'm re-implementing it according to the xpath specification. I'm also keeping in sync with lxml, so if lxml is installed it will be used, if not we will fall back to beautifulsoup. I'm hoping to have a better xpath beautifulsoup parser by tomorrow. I can send it to anyone interested. Any help, suggestions and recommendations are surely welcome. - Turgut ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel