Hi all, some partially bad news from IMDb (thanks to Ori for pointing it to me).
Table of Content: * bad news * good news * temporary fix * how to download the datasets * what's next? [bad news] The http://www.imdb.com/interfaces page states that from now on the plain text data files are released as a set of files in a S3 bucket named 'imdb-datasets'. There are some things that are not nice: 1. they require the users to pay according to the data transfert pricing explained here: https://aws.amazon.com/s3/pricing/ I still haven't downloaded all the data, so I don't know how big it is and can't tell how much you'll spend, but I guess no more than a few cents. 2. the format of the data is completely different from the old one, so it can't be used with imdbpy2sql.py, sorry. (that's the moment where we hate them, if you are wondering ;)) 3. from their description, I fear that many, many, many information are missing. No trivia, biographies, certificates, color info, crazy-credits, country (!), goofs, keywords, plot, mpaa ratings, movie links, companies, quotes, sound mix, sound tracks, taglines, technical details and may be more. [good news] In this land of sorrow, there's also a good news: now titles and persons are identified using their real imdbID (nm0000001 / tt0000001), so you can link a web page to an entry in this dataset. Plus, the dataset is updated daily and seems much more db-friendly to parse. [temporary fix] It seems that for the moment the old ftp mirrors are still updated. I'm quite sure they will be shut down soon, but for now you can still download the data in the old format from: ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/ [how to download the datasets] To access the data, you have to create an Amazon AWS account; then, on "My Security Credentials", go to IAM Users and add a new user. Also add a separate, new, group for that user and grant it the "AmazonS3FullAccess" policy. (I'm pretty sure it can be limited to more granular permissions, but for the moment we don't care). Creating the user, an Access Key will be created: please store the Access key ID and the Secret Key. Now, with those keys, you can download the objects in the 'imdb-datasets' bucket. There are various ways. For example you can use s3cmd. Install it and configure it (one time only) with: s3cmd --configure Then, to download a file (all on the same line): s3cmd --requester-pays --continue get s3://imdb-datasets/documents/v1/current/title.basics.tsv.gz the available objects are: title.basics.tsv.gz, title.crew.tsv.gz, title.episode.tsv.gz, title.principals.tsv.gz, title.ratings.tsv.gz, name.basics.tsv.gz [what's next?] Uhhh... who knows. I have to give a better look at the data format. We can introduce a new script to import the new datasets. If we use a completely new db schema, I'm sure we can import the data very quickly, but that would mean that a new 'parser' module must be written to read it. Probably not too complex, but more code to write. If we stay with the current db schema, the importer will be more complex, but we can still use the 'sql' parser. Not sure... the first road means a fresh start, that sometime is needed to improve. ;-) What do you think? Any opinion? Anyone willing to help with the new code? IMDbPY summer of code is open: we pay in (little) exposure! http://theoatmeal.com/comics/exposure ;-) -- Davide Alberani <davide.alber...@gmail.com> [PGP KeyID: 0x3845A3D4AC9B61AD] http://www.mimante.net/ ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel