Re: [Imdbpy-devel] First patch for DOM parser

H. Turgut Uyar Thu, 03 Jul 2008 04:14:40 -0700

Davide Alberani wrote:
> On Jul 02, "H. Turgut Uyar" <[EMAIL PROTECTED]> wrote:
> 
> Seen - very useful for such a generic parser.  For other parsers used
> for multiple pages (like persons/characters), maybe we can write
> two separated parser: after all the DOM approach spares so many lines
> of code... :-)
>


Generally separate parsers is better, as in the search movie, person 
etc. parsers. But in this case (official sites, external reviews, ...) 
all of these parsers would have the exact same extractors and attributes 
which would result in repeated code.

> I've committed support for names/titles references (mostly untested).

Seen it. I still have to figure out how references are used. Can you 
tell me where to find an example?

> As you can see from the GatherRefs class I still have some problems
> with DOM/XPath: I'm almost sure there is a cleaner way to obtain the
> same result.
> 

I think the ones you've written are fine. One suggestion though: I think 
we should always write path expression strings in double quotes, because 
single quotes can be part of the expression itself.

> Speaking of that: I was thinking at a parser for the movie's quotes
> page, and I had some real trouble: the data is not in a <ul> list,
> but just separated by <hr> and I can't find an easy way to express - with
> XPath - the portion of document I need.  Can you write me an example,
> for a parser for: http://akas.imdb.com/title/tt0133093/quotes ?
> 

Yes, this one's tricky. Playing with XPather I see that the following 
path gives me all the 'b' elements that contain character names for 
quotes (could be the extractor):
   //b/a[starts-with(@href, '/name/nm')]/..

After that the attributes could be:
   character link: a/@href
   character name: a/text()
   quote:          following-sibling::text()
   section:        preceding-sibling::a[1]/@name

The section specification would group quotes using the name attribute of 
the preceding 'a' element. Two problems here:

- Still have to handle the notes in italic.

- My bsoup interpretor does not support preceding-sibling yet but it 
should be easy to add.

Turgut

> Thanks!
> 


-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Re: [Imdbpy-devel] First patch for DOM parser

Reply via email to