Davide Alberani wrote:
> [the quotes parser]
>
> Quite tricky. :-)
>
>> quote: following-sibling::text()
>
> Using following::text() seems to collect also the notes in italic,
> but I don't know if it's too difficult to support it for bsoup.
>
BeautifulSoup has some functionality that might be used for this but I
have to check.
I started to see xpath expressions in my dreams and something occurred
to me last night :-) The path I suggested for the extractor is not good
enough:
//b/a[starts-with(@href, '/name/nm')]/..
This assumes that the character has a page. In the "Matrix" quotes all
characters have pages so it seemed to work but for example on this page
there is a "Policeman" character who does not have a page:
http://akas.imdb.com/title/tt0295701/quotes
The solution that comes to my mind is to use your preprocess_string
feature and insert 'div' tags into the HTML code:
Put a
<div class="_imdbpy">
in front of every
<a name="qtXXX">
and a
</div>
after (or before) every
<hr width="30%">
After that, the extractor path should be easier to write.
Do you want to write the parser yourself or would you like me to have a
go at it?
Turgut
-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Imdbpy-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel