Dan Stromberg wrote: > I've been putting a little bit of time into a file indexing engine [...]
> So far, I've been taking the approach of using a single-table database > like gdbm or dbhash [...] and making each entry keyed by > a word, and under the word in the database is a null terminated list of > filenames (in http://dcs.nac.uci.edu/~strombrg/base255.html representation). > [...] > the program just isn't performing like I'd like it to. > > And I think it's because despite the caching and minimal representation > conversion, it's still just too slow converting linear lists to arrays > back and forth and back and forth. I think I follow. The straightforward method of building the list of files associated with a word is order n**2 time. On each new entry, you look up the entire string, append one file-id to it, and write the new version back. > So this leads me to wonder - is there a python database interface that > would allow me to define a -lot- of tables? Like, each word becomes a > table, and then the fields in that table are just the filenames that > contained that word. That way adding filenames to a word shouldn't bog > down much at all. Well, you could use simple files instead of fancy database tables. Below is a demo of an alternate technique that uses bsddb B-Trees, and puts both the word and the file-id in the key. I don't know how efficient it is for real data, but at least the time won't grow as Theta(n**2). --Bryan import bsddb import urllib def add_words_from_file(index, fname, word_iterator): """ Pass the open-for-write bsddb B-Tree, a filename, and a list (or any interable) of the words in the file. """ fname = urllib.quote_plus(fname) s = set() for word in word_iterator: if word not in s: s.update(word) key = '%s %s' % (urllib.quote_plus(word), fname) index[key] = '' index.sync() def lookup(index, word): """ Pass the B-Tree (as built with add_words_from_file) and a word to look up. Returns list of files containing the word. """ word = urllib.quote_plus(word) fname_list = [] try: current = index.set_location(word) while True: (key, _) = current (w, fname) = key.split() if w != word: break fname_list.append(urllib.unquote_plus(fname)) current = index.next() except KeyError: pass return fname_list def test(): index = bsddb.btopen('junktest.bdb', 'n') data =[ ('bryfile.txt', 'nor heed the rumble of a distant drum'), ('junkfile.txt', 'this is the beast, the beast so sly'), ('word file.txt', 'is this the way it always is here in Baltimore') ] for (fname, text) in data: words = text.split() add_words_from_file(index, fname, words) for word in ['is', 'the', 'heed', 'this', 'way']: print '"%s" is in files: %s' % (word, lookup(index, word)) test() -- http://mail.python.org/mailman/listinfo/python-list