Hello Gary,

ghmgc> A fundamental problem with the approach a
ghmgc> la:

ghmgc> for row in view:
ghmgc>     vw2.append((row.a, row.b))

ghmgc> is that it is *extremely* inefficient.  It
ghmgc> hurts me to even look at it, thinking of all the accesses, assignments,
ghmgc> copies, function call overhead, etc. that are going on underneath things
ghmgc> (added to which is the overhead induced by the fact that it is a Python
ghmgc> loop instead of a C loop).


It certainly must be the case, see below for the timing tests of three
naive programs I wrote to test MetaKit, BSDDB and sqlite -- those
programs read the same file containing 1.2 million English words and
put 3 copies of the same word in a record like 'word1, word2, word3,
integer_index' in each db. MetaKit comes out worst in such scenario,
I'm sad to say. Programs are all written in manner:

create list_of_records
for row in list_of_records:
    write_to_db(row)

Well, I know that it is sort of comparing apples to oranges, since in
bsddb I just put one string into db while in MetaKit those are 4
separate properties that I can do smth with (hence I do not give up on
MetaKit, because I need fancier data manipulation than just reading a
string from db, and if there's no key for this field/property I need,
then I have to read the whole damn db).

One could argue that I should exploit the column-wise data
organization of MetaKit, but frankly, I don't have the faintest idea
how to do it. And again it's probably non-trivial, given that e.g.
file object iterators in Python obviously work in record/line oriented
manner (though I didn't have them in this case, all data was in memory
to prevent disk access trashing between the text file and the db and
thus to isolate performance of db as much as possible), so there's a
sort of mismatch between organization of the usual data sources and
the data organization of storage.


# this is sqlite version
>>> import mlite
start...
0, time: 18.25, delta: 18.25
100000, time: 26.77, delta: 8.52
200000, time: 35.17, delta: 8.41
300000, time: 43.69, delta: 8.52
400000, time: 52.24, delta: 8.55
500000, time: 60.94, delta: 8.70
600000, time: 69.69, delta: 8.75
700000, time: 78.45, delta: 8.77
800000, time: 87.19, delta: 8.73
900000, time: 96.00, delta: 8.81
1000000, time: 104.75, delta: 8.75
1100000, time: 113.55, delta: 8.80
1200000, time: 122.38, delta: 8.83
after syncing: 126.28
end.

# this is MetaKit version
>>> reload(mk3)
start...
0, time: 22.42, delta: 22.42
100000, time: 40.03, delta: 17.61
200000, time: 56.34, delta: 16.31
300000, time: 74.28, delta: 17.94
400000, time: 94.05, delta: 19.77
500000, time: 115.20, delta: 21.16
600000, time: 138.20, delta: 23.00
700000, time: 163.02, delta: 24.81
800000, time: 189.77, delta: 26.75
900000, time: 218.09, delta: 28.33
1000000, time: 248.31, delta: 30.22
1100000, time: 280.19, delta: 31.88
1200000, time: 314.25, delta: 34.06
Values written, now syncing, time: 329.50
After syncing: 356.61
end.

# this is BSDDB version
>>> import mbs3
start
0, time: 16.23, delta: 16.23
100000, time: 18.36, delta: 2.13
200000, time: 20.59, delta: 2.23
300000, time: 22.86, delta: 2.27
400000, time: 25.11, delta: 2.25
500000, time: 27.91, delta: 2.80
600000, time: 30.27, delta: 2.36
700000, time: 32.48, delta: 2.22
800000, time: 34.73, delta: 2.25
900000, time: 36.97, delta: 2.23
1000000, time: 39.27, delta: 2.30
1100000, time: 41.55, delta: 2.28
1200000, time: 43.97, delta: 2.42
Values written, now syncing, time: 45.55
After syncing: 45.70
end







-- 
Best regards,
 Marcin                            mailto:[EMAIL PROTECTED]

_____________________________________________
Metakit mailing list  -  [email protected]
http://www.equi4.com/mailman/listinfo/metakit

Reply via email to