On Tue, Feb 7, 2012 at 09:20, Davide Alberani <davide.alber...@gmail.com> wrote: > > As usual, I'm really busy right now... I hope to have time to give it > a look this weekend.
Ehi, snowstorms buy you a lot of free time... :-P It was easier that I thought, mostly thanks to the fact the we already have md5 checksum of names and title (a more or less recent feature). In the mercurial repository there's a draft of solution. How it works: - titles/names with imdbID are stored in a dbm database, using their md5 as keys. - at restore time, imdbIDs are restored in batches of 10000 each time. Notes: - by default, the database are created in the current directory (and not deleted); there's now the '-t dir' command line argument, to specify a temporary directory. - I've not tested it with huge amounts of data: if it's slow or fails, let me know if it's while storing or restoring the IDs (and the error message). - 10.000 entries for a batch is *totally* arbitrary: we've to choose a good compromise between performances and the maximum size of a query. - the batch is executed as a single query, like: UPDATE table SET imdb_id = CASE md5sum WHEN 'md5_1' THEN 'imdbID1' ... END WHERE md5sum IN ('md5_1', md5_2', ...) I don't really know if this syntax is valid for every SQL databases... - I've simplified the code, maybe too much. - I've not tested it with CSV support. As usual, any test, bug report, comment and so on is welcome. -- Davide Alberani <davide.alber...@gmail.com> [PGP KeyID: 0x465BFD47] http://www.mimante.net/ ------------------------------------------------------------------------------ Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ _______________________________________________ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel