[Imdbpy-devel] Extractor grouping in DOM parser

H. Turgut Uyar Tue, 12 Aug 2008 08:11:02 -0700

Hi,

At the risk of complicating the DOM parser, I've added an extractor 
grouping feature that should improve the parser performance. The aim is 
to eliminate repeated xpath lookups. Now, if the extractor has a 'group' 
attribute (which is supposed to be an xpath), that will be applied first 
to get a set of group elements together with a group key. Then the 
extractor path will be applied to get the final elements. If the 
attribute key is None, the group key will be the key. Here's an example 
using the old method:


                   Extractor(label='glossarysections',
                          path="//[EMAIL PROTECTED]'glossary']/../../../..//tr",
                          attrs=Attribute(key="../tr[1]/td/h5/a/@name",
                               multi=True, ...),

This method unnecessarily applies the attribute key xpath over and over. 
Especially large pages like the person keywords page become terribly 
slow because of this. The new method is:

                  Extractor(label='glossarysections',
                          group="//[EMAIL PROTECTED]'glossary']",
                          group_key="./@name",
                          path="../../../..//tr",
                          attrs=Attribute(key=None,
                               multi=True, ...),

The group_key is ".//text()" by default. I have reviewed the existing 
parsers to take advantage of this, but the old method is also still valid.

As a note, I think the markup of the person genres/keywords page has 
changed and the old parser is not working correctly.

Turgut


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

[Imdbpy-devel] Extractor grouping in DOM parser

Reply via email to