2005/6/17, Jim Gallacher <[EMAIL PROTECTED]>:
> Nicolas Lehuen wrote:
> > Hi Jim,
> >
> > You've done a pretty impressive work here. What surprises me is the
> > O(n) behaviour on DBM and FS. This seems to mean that indexes (or
> > indices, if you prefer) ar not used.
> 
> ext2/ext3 uses a linked list to access files, hence O(n) when adding a file.

Duh... And they call that a filesystem :P. That's where ReiserFS, the
WinNT FS and other modern FS shine : they use more efficient data
storage (I think they use BTree indices).

> > For DBM, well, if BDB could not handle indexes, this would be big
> > news. Are you 100% sure that the Berkeley implementation is used ?
> 
> I used dbhash, which according to the python docs is the interface to
> the bsddb module. The code is pretty much the same as in DbmSession.
> Code snippet is at the bottom of this message. Running "/usr/bin/file
> bsd.dbm" gives:
> 
> bsd.dbm: Berkeley DB (Hash, version 8, native byte-order)
> 
> **** Brain Wave ****
> 
> It just occured to me that the performance problem may be related to
> opening and closing the dbm file for every record insertion. Adjusting
> the test so that the file is only opened once, I get O(1), and a great
> speed boost: 0.2 seconds / per 1000 records all the way up to 50,000
> records. At that point I start to see period performance hits due to
> disk syncing, but that is to be expected.
> 
> I have no idea what to do with this knowledge unless we can figure out a
> way to keep the dbm file open across multiple requests. Ideas??
>
> **** End of Wave ****

Well, we could keep a permanent reference to the opened dbm file at
the module level and make sure that locks are used so that everything
works correctly in a threaded MPM environment ?
 
> > For FS, I don't know about ext3, but in ReiserFS or the Win NT
> > filesystem, there are indexes that should speed up file lookups, and
> > should certainly not yield a O(n) performance.
> 
> Don't forget, I only benchmarked creating new session files. Reading, or
> writing to existing files may be an entirely different matter. Certainly
>   one of the benefits of ReiserFS is that it can handle a large number
> of small files in an efficient manner.
>
> > Anyway, implementing
> > FS2 instead of FS is not that difficult, and if it yields predictable
> > results even on ext3, then we should go for it.
> 
> Already done - it's just a couple of extra lines. Doing some testing today.

Are you replacing FS with FS2 or adding a new implementation ? I think
replacing is better, since I can't see any drawback to the FS2
approach.

> > As for the MySQL implementation, well, I've been promising it many
> > times, but I can provide a connection pool implementation that could
> > speed up applicative code as well as your session management code.
> > What I would need to do is to make it mod_python friendly, i.e. make
> > it configurable through PythonOption directives. Do you think it would
> > be a good idea to integrate it into mod_python ?
> 
> Connection pooling seems like a common request on the mailing list, so
> I'd say yes.

I'll see what I can do. The weather is very promising for this
week-end, so I don't think I'll have much time to stay and write some
code (all that given the fact that it's part of my day job ;), but who
knows ?

> > Regards,
> > Nicolas
> >
> 
> Code snippet from my benchmark script.
> 
> import dbhash
> 
> def create_bsd(test_dir, test_runs, number_of_files, do_sync=False):
>      if not os.path.exists(test_dir):
>          os.mkdir(test_dir)
>      dbmfile = "%s/bsd.dbm" % (test_dir)
>      dbmtype = dbhash
>      i = 0
>      timeout = 3600
>      count = 0
>      total_files = 0
>      results_file = '%s/bsd.%02d.results' % (test_dir,i)
> 
>      results = open(results_file,'w')
>      write_header(results, 'dbhash', test_runs, number_of_files)
>      while count < test_runs:
>          start_time = time.time()
>          i = 0
>          while i < number_of_files:
>              sid = _new_sid()
>              data = {'_timeout': timeout,
>                      '_accessed': time.time(),
>                      'stuff': 'test test test test test'
>                      }
>              # dbm file is opened and closed for each insertion
>              # which is the same as the current DbmSession
>              # implentation
>              dbm = dbmtype.open(dbmfile, 'c')
>              dbm[sid] = cPickle.dumps(data)
>              dbm.close()
>              i += 1
>              total_files += 1
> 
>          count += 1
>          elapsed_time = time.time() - start_time
>          print_progress(count, i, elapsed_time)
>          write_result(results,count, i, elapsed_time, total_files)
>          results.flush()
>          if do_sync:
>              s
>

Reply via email to