Re: [Imdbpy-devel] Status on DOM-based parsing

Davide Alberani Thu, 12 Jun 2008 02:03:30 -0700

On Jun 11, "H. Turgut Uyar" <[EMAIL PROTECTED]> wrote:

> I've been working on using DOM-style parsing for the IMDb pages.


As I said, it looks impressive; it's still to be see if XPath expressions
are powerful enough for our needs, but the first tests you produced are
truly amazing.
What we need is a way to replicate the the feature provided by the
imdb.parser.http.utils.ParserBase class:
- get an HTML unicode string as input
- store, if needed, titlesRefs/namesRefs/charactersRefs
- parse the HTML according to some rules: actually this is done by
  the subclasses of ParserBase, but as you've said it can also be
  accomplished with something like a chain of expressions (maybe a
  dictionary mapping the keys we want to assign to to one or more an
  XPath expressions, which specify where/how to get the actual info from
  the HTML).  This, at least, for some of the simpler parsers: in many
  other cases we'll need more code, as an example when we need to
  instantiate objects of Movie/Person/Character/Company classes.

There are also a ot of other things to consider, like the fact that
we always want unicode strings, for our data.

In general, I assume it will be possible to have the old and the
new parser side by side, until every old parser is rewritten with
the new tools.

> The idea is to represent the IMDb page as an xml tree and extract
> the information using xpath.

My only "problem" is that I've used xpath & friends in another life,
and I really need to study it again from zero. :-)
Not a big deal: after all, not being paid, we don't have such an
hurry. :-)

> I'm hoping to have a better xpath beautifulsoup parser by tomorrow. I
> can send it to anyone interested.

I'm sure nobody on the list will be offended for a few KB attachment. :-)
Oh, it goes without saying: if you need writing access to some
portion of the CVS tree, just ask.


Talking about other areas of development:
- I'd like to test if it's possible (another hint by Turgut) to
  replace some of the '::' separated strings with a subclass of unicode.
  Something like (metacode):
  class InfoWithNote(unicode):
     info = u'the main info of this field'
     note = u'the optional note'
     def __unicode__(self): method to print it in the u'info::note' format

  My only fear is about the movies' "plot" keyword: it's in the
  'author of the summary::plot' format, and it's the opposite of any
  other information we gather.  Bad choice of mine. :-/

- due to popular demand I'll investigate the feasibility of a switch
  from SQLObject to SQLAlchemy.
  For what we use it for, it won't make a big difference, but...

- I may shock the world by changing one of the long standing bad
  choices of IMDbPY: I'd like to convert movie['year'] from a
  string to an int. :-)
  Maybe.  I still have to see how much code will be broken.


-- 
Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47]
http://erlug.linux.it/~da/

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Re: [Imdbpy-devel] Status on DOM-based parsing

Reply via email to