[Imdbpy-devel] Simplifying the DOM parser

H. Turgut Uyar Fri, 18 Jul 2008 11:59:14 -0700

Hi,

I've committed a patch after the simplifications I've suggested. Basically:


The key/section separation is gone. Each attribute information obtained 
by applying the path will be assigned to the specified key in the 
result. The key can be a simple string value, an xpath or None (in which 
case the extractor label will serve as key).

If an attribute is marked as multi, the value for the key will be a list 
with each result of the path as an element. If not, the result will be 
directly assigned.

All attribute paths should produce strings. That means, they should end 
in 'text()' or '@attribute'. The strings will automatically be joined by 
the joiner. This is the one issue that I'm not very comfortable with but 
it makes it easier to understand what to expect as result. It has broken 
the existing movie ratings parser, I had to tweak it to make it work 
correctly.

The 'single' attribute is removed, but we might need to introduce it 
back. But I have some problems: My feeling is that this should be 
attached to a path expression, not an attribute (an attribute can have 
multiple sub-paths). And the name 'single' can be confusing when used 
together with the name 'multi'; the two features are actually unrelated 
(or are they not?)

BTW, my previous claim that position predicates can achieve the same 
effect as the 'single' attribute is incorrect, as I should have learned 
from the .//td[1] example from a few days back.

Extractor postprocessors are removed. Attribute postprocessors and 
joiners are not mutually exclusive.

 > I've also changed the DOMHTMLCharacterMaindetailsParser class:
 > the deepcopy wasn't working correctly with Python 2.4 (it's still
 > broken, but for other reasons: I think it's better to use
 > postprocess_data(), to do the required magic).
 >

You're right, I did it that way.

 > Right now I'm trying to use the old test-suite to compare the
 > old parsers with the new ones.
 >

I tried that a few days ago, it gave some errors but I did not have the 
time to look into them.

And some notes:

You might have noticed that the beautifulsoup unicode characters problem 
is solved.

The old parser for movie quotes collects some titles references (when 
trying with the movie 133093) but the new reference gathering parser 
does not.

Turgut


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

[Imdbpy-devel] Simplifying the DOM parser

Reply via email to