I'm no expert in BDBs, but I have spent a fair amount of time working with PostgreSQL and Oracle. It sounds like you need to put some optimization into your algorithm and data representation.
I would do pretty much like you are doing, except I would only have the following relations: - word to word ID - filename to filename ID - word ID to filename ID You're going to want an index on pretty much every column in this database. That's because you're going to lookup by any one of these columns for the corresponding value. I said I wasn't an expert in BDBs. But I do have some experience building up large databases. In the first stage, you just accumulate the data. Then you build the indexes only as you need them. Let's say you are scanning your files. You won't need an index on the filename-to-ID table. That's because you are just putting data in there. The word-to-ID table needs an index on the word, but not ID (you're not looking up by ID yet.) And the word ID-to-filename ID table doesn't need any indexes yet either. So build up the data without the indexes. Once your scan is complete, then build up the indexes you'll need for regular operation. You can probably incrementally add data as you go. As far as filename ID and word IDs go, just use a counter to generate the next number. If you use base255 as the number, you're really not going to save much space. And your idea of hundreds of thousands of tables? Very bad. Don't do it. -- http://mail.python.org/mailman/listinfo/python-list