Re: [Imdbpy-devel] First patch for DOM parser

H. Turgut Uyar Fri, 04 Jul 2008 01:24:49 -0700

Davide Alberani wrote:
> [the quotes parser]
> 
> Quite tricky. :-)
> 
>>    quote:          following-sibling::text()
> 
> Using following::text() seems to collect also the notes in italic,
> but I don't know if it's too difficult to support it for bsoup.
>


BeautifulSoup has some functionality that might be used for this but I 
have to check.

I started to see xpath expressions in my dreams and something occurred 
to me last night :-) The path I suggested for the extractor is not good 
enough:
    //b/a[starts-with(@href, '/name/nm')]/..

This assumes that the character has a page. In the "Matrix" quotes all 
characters have pages so it seemed to work but for example on this page 
there is a "Policeman" character who does not have a page:
   http://akas.imdb.com/title/tt0295701/quotes

The solution that comes to my mind is to use your preprocess_string 
feature and insert 'div' tags into the HTML code:

Put a
   <div class="_imdbpy">
in front of every
   <a name="qtXXX">
and a
   </div>
after (or before) every
   <hr width="30%">

After that, the extractor path should be easier to write.

Do you want to write the parser yourself or would you like me to have a 
go at it?

Turgut


-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Re: [Imdbpy-devel] First patch for DOM parser

Reply via email to