On Jun 11, "H. Turgut Uyar" <[EMAIL PROTECTED]> wrote: > I've been working on using DOM-style parsing for the IMDb pages.
As I said, it looks impressive; it's still to be see if XPath expressions are powerful enough for our needs, but the first tests you produced are truly amazing. What we need is a way to replicate the the feature provided by the imdb.parser.http.utils.ParserBase class: - get an HTML unicode string as input - store, if needed, titlesRefs/namesRefs/charactersRefs - parse the HTML according to some rules: actually this is done by the subclasses of ParserBase, but as you've said it can also be accomplished with something like a chain of expressions (maybe a dictionary mapping the keys we want to assign to to one or more an XPath expressions, which specify where/how to get the actual info from the HTML). This, at least, for some of the simpler parsers: in many other cases we'll need more code, as an example when we need to instantiate objects of Movie/Person/Character/Company classes. There are also a ot of other things to consider, like the fact that we always want unicode strings, for our data. In general, I assume it will be possible to have the old and the new parser side by side, until every old parser is rewritten with the new tools. > The idea is to represent the IMDb page as an xml tree and extract > the information using xpath. My only "problem" is that I've used xpath & friends in another life, and I really need to study it again from zero. :-) Not a big deal: after all, not being paid, we don't have such an hurry. :-) > I'm hoping to have a better xpath beautifulsoup parser by tomorrow. I > can send it to anyone interested. I'm sure nobody on the list will be offended for a few KB attachment. :-) Oh, it goes without saying: if you need writing access to some portion of the CVS tree, just ask. Talking about other areas of development: - I'd like to test if it's possible (another hint by Turgut) to replace some of the '::' separated strings with a subclass of unicode. Something like (metacode): class InfoWithNote(unicode): info = u'the main info of this field' note = u'the optional note' def __unicode__(self): method to print it in the u'info::note' format My only fear is about the movies' "plot" keyword: it's in the 'author of the summary::plot' format, and it's the opposite of any other information we gather. Bad choice of mine. :-/ - due to popular demand I'll investigate the feasibility of a switch from SQLObject to SQLAlchemy. For what we use it for, it won't make a big difference, but... - I may shock the world by changing one of the long standing bad choices of IMDbPY: I'd like to convert movie['year'] from a string to an int. :-) Maybe. I still have to see how much code will be broken. -- Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47] http://erlug.linux.it/~da/ ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel