Davide Alberani wrote: > [the quotes parser] > > Quite tricky. :-) > >> quote: following-sibling::text() > > Using following::text() seems to collect also the notes in italic, > but I don't know if it's too difficult to support it for bsoup. >
BeautifulSoup has some functionality that might be used for this but I have to check. I started to see xpath expressions in my dreams and something occurred to me last night :-) The path I suggested for the extractor is not good enough: //b/a[starts-with(@href, '/name/nm')]/.. This assumes that the character has a page. In the "Matrix" quotes all characters have pages so it seemed to work but for example on this page there is a "Policeman" character who does not have a page: http://akas.imdb.com/title/tt0295701/quotes The solution that comes to my mind is to use your preprocess_string feature and insert 'div' tags into the HTML code: Put a <div class="_imdbpy"> in front of every <a name="qtXXX"> and a </div> after (or before) every <hr width="30%"> After that, the extractor path should be easier to write. Do you want to write the parser yourself or would you like me to have a go at it? Turgut ------------------------------------------------------------------------- Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW! Studies have shown that voting for your favorite open source project, along with a healthy diet, reduces your potential for chronic lameness and boredom. Vote Now at http://www.sourceforge.net/community/cca08 _______________________________________________ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel