Hi,
At the risk of complicating the DOM parser, I've added an extractor
grouping feature that should improve the parser performance. The aim is
to eliminate repeated xpath lookups. Now, if the extractor has a 'group'
attribute (which is supposed to be an xpath), that will be applied first
to get a set of group elements together with a group key. Then the
extractor path will be applied to get the final elements. If the
attribute key is None, the group key will be the key. Here's an example
using the old method:
Extractor(label='glossarysections',
path="//[EMAIL PROTECTED]'glossary']/../../../..//tr",
attrs=Attribute(key="../tr[1]/td/h5/a/@name",
multi=True, ...),
This method unnecessarily applies the attribute key xpath over and over.
Especially large pages like the person keywords page become terribly
slow because of this. The new method is:
Extractor(label='glossarysections',
group="//[EMAIL PROTECTED]'glossary']",
group_key="./@name",
path="../../../..//tr",
attrs=Attribute(key=None,
multi=True, ...),
The group_key is ".//text()" by default. I have reviewed the existing
parsers to take advantage of this, but the old method is also still valid.
As a note, I think the markup of the person genres/keywords page has
changed and the old parser is not working correctly.
Turgut
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Imdbpy-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel