On Jul 07, "H. Turgut Uyar" <[EMAIL PROTECTED]> wrote: > Yes. We need documentation and guidelines about how to write the > parsers. We also need to sort out the conceptual issues about > extractors, attributes, keys, postprocessors etc.
Trying to write a replacement for movieParser.HTMLRatingsParser I started with the below code (incomplete and the data is not in what will be the required format); beside the fact that bsoup raise an IndexError exception, the problem here is the current parse_dom method is unable to map the structure of the expected output, which is something like: {'arithmetic mean': 8.5, 'rating': 8.5, 'votes': 274730, 'median': 9, 'number of votes': {1: 7128, 2: 1958, 3: 2456, 4: 3007, 5: 5052, 6: 9531, 7: 21999, 8: 45371, 9: 69489, 10: 108739}, 'demographic': {u'aged 45+': (11724, 7.7), u'imdb staff': (36, 8.8), u'aged 30-44': (60029, 8.5), u'females': (28663, 8.3), u'females aged 30-44': (7489, 8.3), 'all votes': (274730, 8.5), u'females aged 45+': (2020, 7.4), u'males': (189097, 8.6), u'males aged 18-29': (120639, 8.8), u'males under 18': (5191, 8.9), u'aged 18-29': (139003, 8.8), u'males aged 30-44': (51737, 8.5), u'non-us users': (135562, 8.6), u'females aged 18-29': (17468, 8.4), u'us users': (81423, 8.5), u'females under 18': (1192, 7.5), u'aged under 18': (6392, 8.8), u'top 1000 voters': (786, 7.4), u'males aged 45+': (9541, 7.7)}, 'top 250 rank': 32} The code (only for 'number of votes' and 'mean and median'): class DOMHTMLRatingsParser(DOMParserBase): extractors = [ Extractor(label='number of votes', path="//td/b[text()='Percentage']/../../..", attrs=Attribute(key='ccc', path={'votes': ".//td[1]//text()", 'percentage': ".//td[2]//text()", 'ordinal': ".//td[3]//text()"}), ), Extractor(label='mean and median', path="//p[starts-with(text(), 'Arithmetic mean')]", attrs=Attribute(key='mean and median', path='text()', single=True)) ] For 'number of votes', using a dictionay in the Attribute.path, we get a joined string with all the values, while using a list of Attribute, we get three lists of values. Two possible solutions: leave it this way, let parse_dom return lists and manage it later with code. On the other side we can find a more generic way to map the data fetched by XPath expressions (lists? list of lists? dictionaries?) to our expected output. The will be some cases where some code will still be required, but very few. The problem is that I don't have an idea about how we can write something like "fetch data using these XPaths and build a dictionary where keys are named this way and values are string/tuples/lists of the fetched values). Obviously writing code to be later evaluated with 'exec'/'eval' is out of question. :-) > Besides, the design style of these modules should match the other > parts of imdbpy, so I'm suggesting that you set the guidelines. That's not a big deal: the actual code is already close enough to satisfy this requirement: the 'html' parsers are used by other data access systems in some special cases, but it will be easy to use the new ones. At this time I'm thinking about how we can rewrite parse_dom in a series of at least two separated stages. 1st stage: tell the XPaths the data to be fetched, maybe with minor features like "return None if empty, and store this data in an intermediate generic format (a dictionary with lists of strings as its values?) 2nd stage: write rules to transform the data from this intermediate format to the one we need. The problem is: how to express that? -- Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47] http://erlug.linux.it/~da/ ------------------------------------------------------------------------- Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW! Studies have shown that voting for your favorite open source project, along with a healthy diet, reduces your potential for chronic lameness and boredom. Vote Now at http://www.sourceforge.net/community/cca08 _______________________________________________ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel