Hello Wolfgang,
WL> i have the impression that although there is some documentation of
WL> metakit's column-oriented storage, there is very little knowledge about
WL> (1) why this should be a good idea,
Rationale is perfectly simple: frequently the user only needs to
access particular columns from the table and not all of them, so
reading the whole record is actually a waste. Sybase used once the
same rationale for column-oriented organization IIRC in one of their
data mining dbms.
WL> and (2) how to exploit this fact and
WL> in what use cases. some clarification of these two-and-a-half issues would
WL> be very welcome.
Most definitely - I was even feeling like I could write docs on this thing
myself and post them somewhere, but I have to get to know it first..
WL> however, the way things tend to be traditionally handled, this is either a
WL> design-time event that occurs not less often the more advanced the work in
WL> progress is. so any outstanding performance of a dbms to handle this
WL> particular situation is largely love's labour and lost for run-time.
Half of the problem is that record-oriented work is ingrained so
strongly in our minds. Changing that is the easy part - the hard part
is that the tools, if only e.g. DBF/CSV/file reading functions/classes
are oriented towards rows, not columns. So either some interface for
dealing with that is developed or all those smart optimisations are
misused and subsequently go down the drain like you described.
<snip>
WL> for these reasons i am much more interested in things working effectively
WL> with row-based access, no matter how they are implemented internally. let
WL> me remark that this particular point has made me look closer into sqlite
WL> as an alternative to metakit.
Well, yeah, but that's traditional SQL. In many applications (e.g.
natural language processing, which is what I'm interested in) that
just sucks. Or should I say: it's nearly unacceptable.
WL> while i am at it, i might further say that
WL> the other point is that i found that successive fetches of rows from a
WL> metakit table -- which, as i understand, is always entirely kept in
WL> memory, is outperformed by one or two orders of magnitude by an approach
WL> that first mirrors an entire table (with about 1k objects) into a python
WL> dictionary and then redirects fetches to this cache (meaning if you need
WL> more than 1% of the rows, you had better read *all* of them first).
I'm not sure what resides on disk and what resides in memory, but I'm
not interested in that really - the result of mmapped file and the MK
magic is that in theory I get the structure that works 'as if' it
resided in memory while it may be considerably bigger than the
physical memory (theoretically to the limit of the physical file size
available in the OS). In some applications that is VERY attractive.
I'm actually somewhat disappointed by Python interface to MK. I thought I
would be able to access MK view as if it were a list of lists or list
of dictionaries, not via the 'RowRef' objects with dot access, like
ViewName.PropertyName and iterating over view or doing
find(name='Joe'). I'd prefer smth more resembling typical operation of
Python's list/dicts. An ideal would be IMHO smth like:
a={'joe':['doe', 44], 'john': ['lennon', 45], 'joe': ['smith', 42]}
..but implemented in such a way that a user could access any column as
a key, smth a la multi-way dictionary, so user could specify
a[(0, 'joe')]
and get a list of lists for all joes (in MK-speak, that's obviously a
view, just interfaced this way) in 0th column AND simultaneusly be
able to specify a[(1, 'lennon')]. I'm sure implementing smth like this
would be a lot of work, though.
Nevertheless, the mmapped principle of MK in many applications is
sound.
WL> let me finish this longish posting by stating that i believe that metakit
WL> is a great product, but i also believe that its python interface needs at
WL> the very least a sound overhaul (i think i could contribute parts of an
WL> interface i have been building during recent months).
I'd be grateful for _anything_ that eases my work with MK in Python --
the reason I moved to Python from C is automatic memory management and
productivity and standard library and ... and ..., so there's no point
in going back now.
WL> also, it is hard to
WL> see why i should subscribe to a database working in memory that is
WL> outperfomed by a standard dictionary, something that i could have achieved
WL> by merely pickling my data.
Will you fit the 2 GB dictionary into memory at once? And several of
them at the same time? OK, 64-bit systems are becoming affordable. But
that's just rescaling the problem up, the dynamic memory / filesystem
size proportions will still hold.
Yes, that is all STILL disk access and I'm aware of that - but either
I do the 'memory mapping' manually in an extremely crude manner like
by using SQL or reading BSDDB records, unpickling, analyzing,
processing, pickling, etc, or I let smth like MK handle that for me -
much more efficiently and economically. There's no way I could
manually achieve comparable optimizations at high / very high level.
And then there's the cost of doing all that manually. Yep, persistence
is the future. I'm kind of surprised JCW did not take MK in that
direction.
I was even thinking of using memory-mapped files directly, they're
available in Python. But using smth like MK is so much nicer.
>>>>> import mbs3
WL> i am curious about this -- would you disclose your sources for us, marcin?
No problem, but those are really dirty & quick hacks - I have very little
experience with Python as well, anyway they're pasted below.
The data source was two concatenated copies of 300 thousand word list,
available here:
http://www.comp.lancs.ac.uk/computing/research/stemming/general/300twl.zip
Anyway, time to end this horribly long post.
### SQLite version - this uses the db that is already structured and the
### script below does not populate db, sorry, schema by sqlite3.exe:
### CREATE TABLE words (word1 text, word2 text, word3 text, idx integer);
import sqlite
import time
bwl = list(open('bwl.txt').readlines())
def mlite():
print 'start...'
cx = sqlite.connect('bwlsqlite.db')
ct = cx.cursor()
tstart=tlast=time.time()
for w, index in zip(bwl, range(0,len(bwl)-1)):
if index % 100000 == 0:
dt, tlast=time.time()-tlast, time.time()
print "%d, time: %.2f, delta: %.2f" % (index, time.time() - tstart,
dt)
ct.execute('INSERT INTO words VALUES("%s", "%s", "%s", %d);' % (w, w,
w, index))
print "after syncing: %.2f" % (time.time() - tstart)
print 'end.'
mlite()
### MetaKit version
import metakit
import time
bwl = list(open('bwl.txt').readlines())
def mktest():
print 'start...'
db=metakit.storage('test.mk',1)
vw = db.getas('words[word1:S,word2:S,word3:S,idx:L]')
tstart=tlast=time.time()
for w, index in zip(bwl, xrange(0,len(bwl)-1)):
if index % 100000 == 0:
dt, tlast=time.time()-tlast, time.time()
print "%d, time: %.2f, delta: %.2f" % (index, time.time() - tstart,
dt)
vw.append(word1=w,word2=w,word3=w,idx=index)
print "Values written, now syncing, time: %.2f" % (time.time() - tstart)
db.commit()
print "After syncing: %.2f" % (time.time() - tstart)
print 'end.'
mktest()
### BSDDB version
import bsddb
import time
bwl = list(open('bwl.txt').readlines())
def mbs():
db=bsddb.btopen('bwl.bsddb','n')
tstart=tlast=time.time()
for w, index in zip(bwl, xrange(0,len(bwl))):
if index % 100000 == 0:
dt, tlast=time.time()-tlast, time.time()
print "%d, time: %.2f, delta: %.2f" % (index, time.time() - tstart,
dt)
db['%s' % index]='%s, %s, %s, %d' % (w, w, w, index)
print "Values written, now syncing, time: %.2f" % (time.time() - tstart)
db.sync()
print "After syncing: %.2f" % (time.time() - tstart)
print 'start'
mbs()
print 'end'
### end
--
Best regards,
Marcin mailto:[EMAIL PROTECTED]
_____________________________________________
Metakit mailing list - [email protected]
http://www.equi4.com/mailman/listinfo/metakit