There are a couple of ways you can speed up your code here which I
will get to in a moment. You make a point about the speed of mk
versus the speed of raw python dictionary, but you have to realize
that the cost of loading in mk or sqlite largely depends on the fact
that you are going to perform relational operations on the data. If
you are not planning on relational operations, mk is probably not
appropriate.
What I mean, is if you just need a key->data mapping, then bsddb3 is
the right thing here. However, if you have multiple views that need
to be joined together, sorted, grouped etc then metakit/sq lite is
very appropriate.
Here are some hints:
commit early, commit often. You will notice that your timings get
slower and slower and then you take a big hit when the data is
committed. You should commit every 100,000 appends or so in your data
structure. Larger structures need more commits.
Blocked views are the best for large data sets. I'm going to skip a
lot of technical details here, but a blocked view is an intelligent
collection of other views. A single view in metakit can be around
200,000 - 300,000 rows without performance hits, above this you should
use a blocked view. You can create a blocked view as follows:
view = storage.getas("test[_B[a:s,b:s,c:s]]").blocked()
Notice the _B which is the view that holds the subviews. This view
doesn't slow down with later appends and is very robust.
Finally, when appending in python it is faster to append using tuples
vw.append(('1','2','3')))
than using
vw.append(a='1',b='2',c='3')
This is slightly dangerous and the inserts must follow the order they
were defined in the getas statement
I've modified your test as follows, let me know how it goes:
### MetaKit version
import metakit
import time
bwl = list(open('bwl.txt').readlines())
def mktest():
print 'start...'
if os.path.exists('test.mk'): os.remove('test.mk')
db=metakit.storage('test.mk',1)
vw = db.getas('words[_B[word1:S,word2:S,word3:S,idx:L]]').blocked()
tstart=tlast=time.time()
for w, index in zip(bwl, xrange(0,len(bwl)-1)):
if index % 100000 == 0:
dt, tlast=time.time()-tlast, time.time()
print "%d, time: %.2f, delta: %.2f" % (index, time.time() - tstart, dt)
db.commit()
vw.append((w,w,w,index))
print "Values written, now syncing, time: %.2f" % (time.time() - tstart)
db.commit()
print "After syncing: %.2f" % (time.time() - tstart)
print 'end.'
mktest()
_____________________________________________
Metakit mailing list - [email protected]
http://www.equi4.com/mailman/listinfo/metakit