On Tue, Feb 7, 2012 at 09:20, Davide Alberani <davide.alber...@gmail.com> wrote:
>
> As usual, I'm really busy right now... I hope to have time to give it
> a look this weekend.

Ehi, snowstorms buy you a lot of free time... :-P

It was easier that I thought, mostly thanks to the fact the we already have
md5 checksum of names and title (a more or less recent feature).

In the mercurial repository there's a draft of solution.

How it works:
- titles/names with imdbID are stored in a dbm database, using their
md5 as keys.
- at restore time, imdbIDs are restored in batches of 10000 each time.

Notes:
- by default, the database are created in the current directory (and
not deleted);
  there's now the '-t dir' command line argument, to specify a
temporary directory.
- I've not tested it with huge amounts of data: if it's slow or fails,
let me know
  if it's while storing or restoring the IDs (and the error message).
- 10.000 entries for a batch is *totally* arbitrary: we've to choose a
good compromise
  between performances and the maximum size of a query.
- the batch is executed as a single query, like:
      UPDATE table SET imdb_id = CASE md5sum WHEN 'md5_1' THEN
'imdbID1' ... END WHERE md5sum IN ('md5_1', md5_2', ...)
  I don't really know if this syntax is valid for every SQL databases...
- I've simplified the code, maybe too much.
- I've not tested it with CSV support.

As usual, any test, bug report, comment and so on is welcome.


-- 
Davide Alberani <davide.alber...@gmail.com>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Reply via email to