Re: [Elisa] IMDb plugin

Benjamin Kampmann Mon, 27 Oct 2008 10:38:05 -0700

Hi.

Thanks for that plugin. As I can see you've put a lot of energy in it,
writing
your own imdb parser :) .


As I discussed with you this we want to have this as one of the first
plugins
going to the new plugin release procedure we are working on. So we will not
merge it now but keep this merge request as long as the plugin made it
through
the new procedure though.

But as you aked my I still did review and this is the result of it.

bb:comment

I have some general remarks.
 - you are not using BeautifulSoup efficiently at some point. You are
getting
   all the links and loop through them to find out the ones with a certain
   attribute set to something. That can be done with BS directly see [1]:
       soup.findAll("b", { "class" : "lime" })
 - at some point in the parsers you are catching all Exceptions. That is a
   possible bug-hidding-thing and is very unsexy for debugging ;) . So
please
   catch only the Exceptions you are certain about (like TypeError, KeyError
or
   similar).
 - as you told me you wanted to have the parsing methods on class level.
Well
   they are not atm ;) .


[1]
http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20basic%20find%20method:%20findAll(name,%20attrs,%20recursive,%20text,%20limit,%20**kwargs)


>=== added file 'elisa-plugins/elisa/plugins/imdb/models.py'
>--- elisa-plugins/elisa/plugins/imdb/models.py    1970-01-01 00:00:00 +0000
>+++ elisa-plugins/elisa/plugins/imdb/models.py    2008-10-05 22:57:43 +0000
>@@ -0,0 +1,70 @@
 [...]
>>+class MovieLightModel(Model):
 [...]
>>+class SearchListModel(Model):
 [...]
>>+class MovieModel():
 [...]
>>+class PersonModel():

I have to remarks about this. The first one is only a question: why are
MovieModel and PersonModel not inheriting from the Base-Model-Class?
(Even though I have a good reason in the back of my mind. But that requires
a
more general discussion and should not take place here.)

My second remark is something we agreed on inside the team. That is to *not*
put
the models attributes on class level. At some point we wanted to have an
automatic system for creating them at __init__ of the base class, but we
didn't
implement that so far. So for now we decided that we want to have them in
the
__init__ of the particular Model.

This sounds like more work in the first place but let me explain why. Over
time
we had more and more problems debugging the models because of that.
Basically
what happended was the following example code:

    for i in xrange(4):
        model = MovieModel()
        model.cast.append('Micheal Moore')
        model_list.append(model)

As this code is pretty common and the rule for lists of a Model says that it
has to be set to an empty list at creation time that is fine. But with the
initialization of the list on class level we append "Michael Moore" to the
list
on class level. So if the next person now grabs one of the models and tries
to
read the cast "Michael Moore" appears four times:

    print model_list[0].cast
    >> ['Michael Moore', 'Michael Moore', 'Michael Moore', 'Michael Moore']

So. We decided for Models to do the following: everything has to be set at
__init__ (don't forget to call super before!). Lists may be set to an empty
list, everything else may be set to None. I know that is not yet written
down
anywhere so remind me to do that at some point ;) . So for your MovieModel
it
would look like this:
    class MovieModel(Model):
    """
    Represent a movie and hold the metadata informations from Imdb
    """
    def __init__(self):
        super(MovieModel, self).__init__()
        self.imdb_id = None
        self.name = None
        self.mpaa_rating = None
        self.user_rating = None
        self.votes = None
        self.director = [] #as it is a list maybe rename it to direcotrs?
        self.release_date = None # here it would also make sense to say
                       # what kind of date: timestamp?
                     # datetime.datetime object?
        self.genres = []
        self.tagline = None
        self.plot = None
        self.runtime = None
        self.cast = []
        self.countries = []
        self.language = None
        self.certificates = []
        self.poster_href = None
        self.poster_thumbnail = None

One other thing I just saw. Usually we name the Models according the place
they
come from. In this case ImdbMovieModel would make sense. But I am not sure
if
that was a common sense thing or just because we have AlbumModel that we
inherit
from so the YesfmAlbumModel should have a different name. Anyway, I would
not
give a bad review about that atm as we don't have any base MovieModel and
need
to create it later. Then we can still change the names :) .


>=== added file 'elisa-plugins/elisa/plugins/imdb/resource_provider.py'
>--- elisa-plugins/elisa/plugins/imdb/resource_provider.py    1970-01-01
00:00:00 +0000
>+++ elisa-plugins/elisa/plugins/imdb/resource_provider.py    2008-10-05
22:57:43 +0000
>@@ -0,0 +1,412 @@
>
>+IMDB_SERVER = 'www.imdb.com'
>+
>+
>+class ImdbResourceProvider(ResourceProvider):
    [...]
>+
>+    search_uri = "http://"; + IMDB_SERVER + "/find?.*"
>+    search_re = re.compile(search_uri)
>+    title_uri = "http://"; + IMDB_SERVER + "/title/tt.*"
>+    title_re = re.compile(title_uri)
>+
>+    supported_uri = search_uri + '|' + title_uri
I think that following regexp would be better here. But that is a personal
thing (and it is untested ;) ):
    supported_uri = "$http://%s/(find?.*|title/tt.*)" % IMDB_SERVER

>+    def initialize(self):
>+        self.server_uri = MediaUri(IMDB_SERVER)
>+        self.client = ElisaAdvancedHttpClient(host=self.server_uri.host,
>+                                              port=self.server_uri.port or
80)
That looks like overhead. just do self.client = Elisa...(IMDB_SERVER, 80)

>+        return defer.succeed(self)
That is a common mistake: when you inherit from a resource provider it is
always
better to call the base classes in initialize:
    return super(ImdbResourceProvider, self).initialize()

We should really write a document about these things ....


>+    def get(self, uri, context_model=None):
>+        """
>+        GET request to the IMDb.com servers.
>+
>+        It accepts the following types of URLs:
>+
>+          - http://www.imdb.com/find?s=all&q=.* : query to the IMDb.com
web site
>+            which, returns a model that contains a list of MovieLightModel
>+            instances
>+            (L{elisa.plugins.imdb.models.SearchListModel}).
>+
>+          - http://www.imdb.com/title/tt.* : query to the IMDb.com web
site
>+            which returns a MovieModel for the title specified by the IMDb
>+            identifier (for example "tt0123456").  If the movie isn't
found
>+            then an empty object is returned with imdb_id set to None.
>+            (L{elisa.plugins.imdb.models.MovieModel}).
>+
>+        """
That is a very good documentation! Thanks.


+        url = str(uri)
+
+        # Select the correct HTTP client to target
+        if self.search_re.match(url) is not None:
+            result_model = SearchListModel()
+        elif self.title_re.match(url) is not None:
+            result_model = MovieModel()
+        else:
+            # Shouldn't get here if framework honors our support_uri
variable.
+            return (None, defer.fail(NoMatchingResourceProvider('GET: ' +
url)))
As this is not inside the ResourceManager you should not raise the
NoMatchingResourceProvider. It would lead to the problem that the debugger
sees
that message and says that the user is simply missing the imdb resource
provider or even worse someone is catching it and assumes that the provider
is
not loaded because he got that message.

Anyway, as you specified the supported_uri that should not happen in
real-elisa
at any point. So raising a NotImplemented is better as it also means that
there
is a bug in the ResourceManager or your supported_uri (because your provider
was
triggered for a not supported uri).

[...]
>+def _find_movie(imdb_id, movie_list):

I would change the name of that method because _find_movie is not clear.
Maybe
is_movie_already_in_list would be more precise.

>+def _parse_title_page(soup, title):
    [...]
>+    list = []
'list' is actually a reserved python word. I would rename it to result_list.


    [...]
>+        if link.has_key('href'):
>+            # Get the number of user votes.
>+            if link['href'] == 'ratings' and
votes_pattern.match(link.string):
>+                votes = link.string.replace(' votes', '')
>+                movie.votes = votes
>+            # Get the imdb id.
>+            if id_pattern.match(link['href']):
>+                id_str = link['href'].replace('/title/', '')
>+                id_str = id_str.replace('/recommendations', '')
>+                movie.imdb_id = id_str
>+            # Get movie poster
>+            if link.has_key('name') and link['name'] == 'poster':
>+                if link.img != None and link.img.has_key('src'):
>+                    movie.poster_thumbnail = link.img['src']
>+                    movie.poster_href = "http://"; + IMDB_SERVER +
link['href']

Is it possible that a link is containg all three of these informations? If
not
it might be usefull to change it to a elif or do a continue inside the if so
that the rest of the loop is skipped.


In the tests:

>+import binascii
>+import hashlib
useless imports.

>+class TestImdbResourceProvider(TestCase):
    [...]
>+    def test_search_movie_1(self):
    [...]
>+    def test_get_movie_1(self):
I know we have these kind of tests in our plugins, too but that is bad
practice.
They are actually testing a functionality and could fail because of a lot of
reasons (missing network, imdbservers down, server overloaded, page missing
etc). Because of that we introduced a Function-Test-Switch some weeks ago.
It
was maybe not existing when you wrote the code but please add it here now
:).
Its name is run_functional_tests_check in elisa-core/core/utils/misc.py.

And as you mentioned yourself already I would really like to see unit tests
about the parsing itself.


Btw. since a recent change in elisa you have to add an entry point in your
setup.py to make the resource provider be loaded at startup. But that is
something seperate.

Thanks and we will see this plugin soon in our new plugin system then :)
Benjamin

Re: [Elisa] IMDb plugin

Reply via email to