Re: [Imdbpy-devel] First patch for DOM parser

Davide Alberani Tue, 08 Jul 2008 23:48:29 -0700

On Jul 07, "H. Turgut Uyar" <[EMAIL PROTECTED]> wrote:

> Yes. We need documentation and guidelines about how to write the
> parsers. We also need to sort out the conceptual issues about
> extractors, attributes, keys, postprocessors etc.


Trying to write a replacement for movieParser.HTMLRatingsParser
I started with the below code (incomplete and the data is not in
what will be the required format); beside the fact that bsoup raise
an IndexError exception, the problem here is the current parse_dom
method is unable to map the structure of the expected output,
which is something like:
{'arithmetic mean': 8.5, 'rating': 8.5, 'votes': 274730, 'median': 9, 'number 
of votes': {1: 7128, 2: 1958, 3: 2456, 4: 3007, 5: 5052, 6: 9531, 7: 21999, 8: 
45371, 9: 69489, 10: 108739}, 'demographic': {u'aged 45+': (11724, 7.7), u'imdb 
staff': (36, 8.8), u'aged 30-44': (60029, 8.5), u'females': (28663, 8.3), 
u'females aged 30-44': (7489, 8.3), 'all votes': (274730, 8.5), u'females aged 
45+': (2020, 7.4), u'males': (189097, 8.6), u'males aged 18-29': (120639, 8.8), 
u'males under 18': (5191, 8.9), u'aged 18-29': (139003, 8.8), u'males aged 
30-44': (51737, 8.5), u'non-us users': (135562, 8.6), u'females aged 18-29': 
(17468, 8.4), u'us users': (81423, 8.5), u'females under 18': (1192, 7.5), 
u'aged under 18': (6392, 8.8), u'top 1000 voters': (786, 7.4), u'males aged 
45+': (9541, 7.7)}, 'top 250 rank': 32}

The code (only for 'number of votes' and 'mean and median'):

class DOMHTMLRatingsParser(DOMParserBase):
    extractors = [
        Extractor(label='number of votes',
            path="//td/b[text()='Percentage']/../../..",
            attrs=Attribute(key='ccc',
                            path={'votes': ".//td[1]//text()",
                                    'percentage': ".//td[2]//text()",
                                    'ordinal': ".//td[3]//text()"}),
            ),
        Extractor(label='mean and median',
            path="//p[starts-with(text(), 'Arithmetic mean')]",
            attrs=Attribute(key='mean and median',
                            path='text()',
                            single=True))
        ]

For 'number of votes', using a dictionay in the Attribute.path, we
get a joined string with all the values, while using a list of
Attribute, we get three lists of values.

Two possible solutions: leave it this way, let parse_dom return lists
and manage it later with code.  On the other side we can find a more
generic way to map the data fetched by XPath expressions (lists?
list of lists? dictionaries?) to our expected output.
The will be some cases where some code will still be required, but very
few.
The problem is that I don't have an idea about how we can write
something like "fetch data using these XPaths and build a dictionary
where keys are named this way and values are string/tuples/lists
of the fetched values).
Obviously writing code to be later evaluated with 'exec'/'eval'
is out of question. :-)

> Besides, the design style of these modules should match the other
> parts of imdbpy, so I'm suggesting that you set the guidelines.

That's not a big deal: the actual code is already close enough to
satisfy this requirement: the 'html' parsers are used by other
data access systems in some special cases, but it will be easy
to use the new ones.

At this time I'm thinking about how we can rewrite parse_dom
in a series of at least two separated stages.
1st stage: tell the XPaths the data to be fetched, maybe with minor
features like "return None if empty, and store this data in an
intermediate generic format (a dictionary with lists of strings
as its values?)
2nd stage: write rules to transform the data from this intermediate
format to the one we need.

The problem is: how to express that?

-- 
Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47]
http://erlug.linux.it/~da/

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Re: [Imdbpy-devel] First patch for DOM parser

Reply via email to