Here is a little workarround :

-- Extract imdb_id and md5sum  (6sec)
CREATE TABLE title_extract SELECT imdb_id, md5sum FROM title WHERE imdb_id IS 
NOT NULL;
CREATE TABLE name_extract SELECT imdb_id, md5sum FROM name WHERE imdb_id IS NOT 
NULL;

-- Add indexes (12sec)
ALTER TABLE title_extract ADD INDEX md5sum_idx (md5sum)
ALTER TABLE name_extract ADD INDEX md5sum_idx (md5sum)


-- Reset imdb ids ...
UPDATE title SET imdb_id = NULL;
UPDATE name SET imdb_id = NULL;

-- Restore imdb ids for movies (2min)
UPDATE title
INNER JOIN title_extract USING (md5sum)
SET title.imdb_id = title_extract.imdb_id

-- Restore imdb ids for people (5min)
UPDATE name
INNER JOIN name_extract USING (md5sum)
SET name.imdb_id = name_extract.imdb_id



Total time save/restore : less than 10minutes



Le 12 févr. 2012 à 15:52, Emmanuel Tabard a écrit :

> I was wondering, why don't you use the original dbs ?
> 
> Something like that takes 3 seconds: 
> 
> "CREATE TABLE title_extract SELECT imdb_id, md5sum FROM title WHERE imdb_id 
> IS NOT NULL
> CREATE TABLE name_extract SELECT imdb_id, md5sum FROM name WHERE imdb_id IS 
> NOT NULL
> "
> 
> And use your query to restore.
> 
> Should be freaking fast ...
> 
> Le 12 févr. 2012 à 14:56, Davide Alberani a écrit :
> 
>> On Sun, Feb 12, 2012 at 14:20, Emmanuel Tabard <m...@webitup.fr> wrote:
>>> 
>>> Fair enough !
>>> When it was selecting all the not null ids, the memory of the process grows
>>> up and the size of the .db never grows up.
>>> My theory is that dbm save on close ? Does that make sense ?
>> 
>> Strange (even if, being anydbm a generic interface to various underlying
>> modules, you can never tell).
>> 
>> This simple snippet, on my system, creates a 1.2 Gb files and in the process
>> the memory in not used much (besides for caches, but it doesn't matter):
>> 
>> #!/usr/bin/env python
>> import time
>> import anydbm
>> 
>> long_string = 'LALALALA' * 1024
>> db = anydbm.open('/tmp/big.db', 'n')
>> for x in xrange(100000):
>>   x = str(x)
>>   db[x] = long_string
>> 
>> print 'INSERT'
>> db.close()
>> print 'CLOSE'
>> time.sleep(10)
>> print 'DONE'
>> sys.exit()
>> #======================
>> 
>> I fear that the leak is in the cycle on the result of the 'select'. :-/
>> 
>> 
>> -- 
>> Davide Alberani <davide.alber...@gmail.com>  [PGP KeyID: 0x465BFD47]
>> http://www.mimante.net/
> 


------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Reply via email to