2005/6/17, Jim Gallacher <[EMAIL PROTECTED]>:
> Nicolas Lehuen wrote:
> > Hi Jim,
> >
> > You've done a pretty impressive work here. What surprises me is the
> > O(n) behaviour on DBM and FS. This seems to mean that indexes (or
> > indices, if you prefer) ar not used.
>
> ext2/ext3 uses a linked list to access files, hence O(n) when adding a file.
Duh... And they call that a filesystem :P. That's where ReiserFS, the
WinNT FS and other modern FS shine : they use more efficient data
storage (I think they use BTree indices).
> > For DBM, well, if BDB could not handle indexes, this would be big
> > news. Are you 100% sure that the Berkeley implementation is used ?
>
> I used dbhash, which according to the python docs is the interface to
> the bsddb module. The code is pretty much the same as in DbmSession.
> Code snippet is at the bottom of this message. Running "/usr/bin/file
> bsd.dbm" gives:
>
> bsd.dbm: Berkeley DB (Hash, version 8, native byte-order)
>
> **** Brain Wave ****
>
> It just occured to me that the performance problem may be related to
> opening and closing the dbm file for every record insertion. Adjusting
> the test so that the file is only opened once, I get O(1), and a great
> speed boost: 0.2 seconds / per 1000 records all the way up to 50,000
> records. At that point I start to see period performance hits due to
> disk syncing, but that is to be expected.
>
> I have no idea what to do with this knowledge unless we can figure out a
> way to keep the dbm file open across multiple requests. Ideas??
>
> **** End of Wave ****
Well, we could keep a permanent reference to the opened dbm file at
the module level and make sure that locks are used so that everything
works correctly in a threaded MPM environment ?
> > For FS, I don't know about ext3, but in ReiserFS or the Win NT
> > filesystem, there are indexes that should speed up file lookups, and
> > should certainly not yield a O(n) performance.
>
> Don't forget, I only benchmarked creating new session files. Reading, or
> writing to existing files may be an entirely different matter. Certainly
> one of the benefits of ReiserFS is that it can handle a large number
> of small files in an efficient manner.
>
> > Anyway, implementing
> > FS2 instead of FS is not that difficult, and if it yields predictable
> > results even on ext3, then we should go for it.
>
> Already done - it's just a couple of extra lines. Doing some testing today.
Are you replacing FS with FS2 or adding a new implementation ? I think
replacing is better, since I can't see any drawback to the FS2
approach.
> > As for the MySQL implementation, well, I've been promising it many
> > times, but I can provide a connection pool implementation that could
> > speed up applicative code as well as your session management code.
> > What I would need to do is to make it mod_python friendly, i.e. make
> > it configurable through PythonOption directives. Do you think it would
> > be a good idea to integrate it into mod_python ?
>
> Connection pooling seems like a common request on the mailing list, so
> I'd say yes.
I'll see what I can do. The weather is very promising for this
week-end, so I don't think I'll have much time to stay and write some
code (all that given the fact that it's part of my day job ;), but who
knows ?
> > Regards,
> > Nicolas
> >
>
> Code snippet from my benchmark script.
>
> import dbhash
>
> def create_bsd(test_dir, test_runs, number_of_files, do_sync=False):
> if not os.path.exists(test_dir):
> os.mkdir(test_dir)
> dbmfile = "%s/bsd.dbm" % (test_dir)
> dbmtype = dbhash
> i = 0
> timeout = 3600
> count = 0
> total_files = 0
> results_file = '%s/bsd.%02d.results' % (test_dir,i)
>
> results = open(results_file,'w')
> write_header(results, 'dbhash', test_runs, number_of_files)
> while count < test_runs:
> start_time = time.time()
> i = 0
> while i < number_of_files:
> sid = _new_sid()
> data = {'_timeout': timeout,
> '_accessed': time.time(),
> 'stuff': 'test test test test test'
> }
> # dbm file is opened and closed for each insertion
> # which is the same as the current DbmSession
> # implentation
> dbm = dbmtype.open(dbmfile, 'c')
> dbm[sid] = cPickle.dumps(data)
> dbm.close()
> i += 1
> total_files += 1
>
> count += 1
> elapsed_time = time.time() - start_time
> print_progress(count, i, elapsed_time)
> write_result(results,count, i, elapsed_time, total_files)
> results.flush()
> if do_sync:
> s
>