Hi all,
some partially bad news from IMDb (thanks to Ori for pointing it to me).

Table of Content:
* bad news
* good news
* temporary fix
* how to download the datasets
* what's next?

[bad news]

The http://www.imdb.com/interfaces page states that from now on the plain text
data files are released as a set of files in a S3 bucket named 'imdb-datasets'.

There are some things that are not nice:

they require the users to pay according to the data transfert pricing
explained here: https://aws.amazon.com/s3/pricing/

I still haven't downloaded all the data, so I don't know how big it is
and can't tell
how much you'll spend, but I guess no more than a few cents.

the format of the data is completely different from the old one, so it
can't be used
with imdbpy2sql.py, sorry. (that's the moment where we hate them, if
you are wondering ;))

from their description, I fear that many, many, many information are missing.
No trivia, biographies, certificates, color info, crazy-credits,
country (!), goofs,
keywords, plot, mpaa ratings, movie links, companies, quotes, sound mix,
sound tracks, taglines, technical details and may be more.

[good news]

In this land of sorrow, there's also a good news: now titles and
persons are identified
using their real imdbID (nm0000001 / tt0000001), so you can link a web page
to an entry in this dataset.

Plus, the dataset is updated daily and seems much more db-friendly to parse.

[temporary fix]

It seems that for the moment the old ftp mirrors are still updated.
I'm quite sure they will be shut down soon, but for now you can still
download the data in the old format from:

[how to download the datasets]

To access the data, you have to create an Amazon AWS account; then,
on "My Security Credentials", go to IAM Users and add a new user.
Also add a separate, new, group for that user and grant it the
"AmazonS3FullAccess" policy.
(I'm pretty sure it can be limited to more granular permissions, but
for the moment we don't care).

Creating the user, an Access Key will be created: please store the
Access key ID and the Secret Key.

Now, with those keys, you can download the objects in the
'imdb-datasets' bucket.

There are various ways.  For example you can use s3cmd.
Install it and configure it (one time only) with: s3cmd --configure

Then, to download a file (all on the same line):
s3cmd --requester-pays --continue get

the available objects are: title.basics.tsv.gz, title.crew.tsv.gz,
title.episode.tsv.gz, title.principals.tsv.gz, title.ratings.tsv.gz,

[what's next?]

Uhhh... who knows.
I have to give a better look at the data format.

We can introduce a new script to import the new datasets.
If we use a completely new db schema, I'm sure we can import the data
very quickly,
but that would mean that a new 'parser' module must be written to read it.
Probably not too complex, but more code to write.
If we stay with the current db schema, the importer will be more complex, but
we can still use the 'sql' parser.

Not sure... the first road means a fresh start, that sometime is
needed to improve. ;-)

What do you think?
Any opinion?
Anyone willing to help with the new code?

IMDbPY summer of code is open: we pay in (little) exposure!
http://theoatmeal.com/comics/exposure ;-)

Davide Alberani <davide.alber...@gmail.com>  [PGP KeyID: 0x3845A3D4AC9B61AD]

Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
Imdbpy-help mailing list

Reply via email to