Hi, I've committed a patch after the simplifications I've suggested. Basically:
The key/section separation is gone. Each attribute information obtained by applying the path will be assigned to the specified key in the result. The key can be a simple string value, an xpath or None (in which case the extractor label will serve as key). If an attribute is marked as multi, the value for the key will be a list with each result of the path as an element. If not, the result will be directly assigned. All attribute paths should produce strings. That means, they should end in 'text()' or '@attribute'. The strings will automatically be joined by the joiner. This is the one issue that I'm not very comfortable with but it makes it easier to understand what to expect as result. It has broken the existing movie ratings parser, I had to tweak it to make it work correctly. The 'single' attribute is removed, but we might need to introduce it back. But I have some problems: My feeling is that this should be attached to a path expression, not an attribute (an attribute can have multiple sub-paths). And the name 'single' can be confusing when used together with the name 'multi'; the two features are actually unrelated (or are they not?) BTW, my previous claim that position predicates can achieve the same effect as the 'single' attribute is incorrect, as I should have learned from the .//td[1] example from a few days back. Extractor postprocessors are removed. Attribute postprocessors and joiners are not mutually exclusive. > I've also changed the DOMHTMLCharacterMaindetailsParser class: > the deepcopy wasn't working correctly with Python 2.4 (it's still > broken, but for other reasons: I think it's better to use > postprocess_data(), to do the required magic). > You're right, I did it that way. > Right now I'm trying to use the old test-suite to compare the > old parsers with the new ones. > I tried that a few days ago, it gave some errors but I did not have the time to look into them. And some notes: You might have noticed that the beautifulsoup unicode characters problem is solved. The old parser for movie quotes collects some titles references (when trying with the movie 133093) but the new reference gathering parser does not. Turgut ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel