IMDb seems to have changed their pages slightly causing movieParser.py to
include trailing junk characters in the "plot summary".

For instance with "Nothing Like the Holidays" (
http://www.imdb.com/title/tt1151915/), the "plot summary" ends up being:

<begin>
A Puerto Rican family living in the area of Humboldt Park in west Chicago
face what may be their last Christmas together. |  ยป
<end>

I tracked this down to the following code which just deals with the |
character.

                Extractor(label='h5sections',
                        path="//d...@class='info']/h5/..",
                        attrs=[
                            Attribute(key="plot summary",
                                path="./h5[starts-with(text(), " \
                                        "'Plot:')]/../div/text()",
                                postprocess=lambda x: \
                                        x.strip().rstrip('|').rstrip()),

Changing the postprocess to the following fixes the problem by looking for
the "RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK" in addition to the | :

x.strip().rstrip(u'| \u00BB').rstrip()),
------------------------------------------------------------------------------
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Reply via email to