Re: [Imdbpy-devel] First patch for DOM parser

H. Turgut Uyar Mon, 30 Jun 2008 04:37:11 -0700

Davide Alberani wrote:
> I'm committing my changes.  Basically I've moved your _paths structure
> to "extractors", a list/tuple of Extractor instances, which in turn
> contains a list of Attribute instances.
> The design is very close to your and may be a bit more verbose, but
> in the long term can be more readable - I hope.


This is definitely more readable and flexible than my 
tuples/dictionaries soup.

> I've slightly modified the parse_dom method adding minor feature (they
> are absolutely untested - some are still unused by the code!)
> 
> I hope you won't find it a complete mess. :-)
> 

I'm just confused about how to run it and how to activate the DOM 
parsers. The code refers to an imdbdom package as in:
   from imdbdom.utils import ...
And there is a line http/__init__.py:
   from personParser import dom_person_main_parser
But these are not contained in the dom branch. Am I missing something or 
is it meant to be used together with the main branch?

> Basically, now, the parse method calls a set of other methods (including
> parse_dom), so that subclasses can modify the output where they need.
> 
> If something is not clear, ask (I wrote the code in a very small time).
> Every name/structure can still be changed: if you have other ideas
> and/or better names for classes and methods, it's time to do these
> changes.

I agree with "elem" not being a nice name :-) At one time I had thought 
of naming it as "foreach" (to read it like "foreach td/a do the 
attribute paths are..."), then it sounded a bit too mechanical.

> Many things are not handled, like name/title references (but the
> add_refs method is there).
> 
> I've removed the "result" parameter: it was too prone to side-effects;
> now parse_dom always returns a dictionary; later - other methods -
> can return whatever they want.
> 

That's surely the much better way. My intent in using the result 
parameter was to keep my changes to the other parts of the code to a 
minimum. Now that we have a new class, it should definitely be removed.

You've introduced back the analyze_imdbid function which I had removed 
and used simple slicing instead. My reason for removing it was to 
prevent incorrect matches like matching a person id when we are trying 
to get a movie id. But this comes from the simple regex I've written and 
it could be corrected by making analyze_imdbid smarter.

> In general, I'm amazed by the amount of code spared by this
> approach.  Just incredible. :-)
> Obviously there are still many things to do: error handling, for
> one (and checking that everything is unicode, and managing things
> like numeric values, and taking care of html/xml references, and so
> on...)
> 

I'm concerned about the performance and memory usage issues, especially 
with my beautifulsoup interpreter and the lambda functions for 
postprocessing. I'm hoping the network delays will make the parsing 
delays tolerable. Of course, there's always the lxml option if anyone 
needs more speed, or less memory consumption.


- Turgut


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Re: [Imdbpy-devel] First patch for DOM parser

Reply via email to