Re: [elephant-devel] Lisp Btrees: design considerations

Ian Eslick Tue, 13 May 2008 13:03:34 -0700

I think they key decision was what serialization format we're going touse for btree nodes, log entries, etc and how that relates to cachingdata during transactions, maintaining empty lists, etc.

The current serializer produces a byte sequence. If we continue withthat model, how do we write/read this stuff from disk? How and wheredo we store it prior to committing a transaction?

When we create a new key or value as a binary stream within atransaction, how is it stored in memory? If we want a multi-process,but non-socket based approach, we need to figure out how to store datain shared memory, etc.

For example, in BDB, the primitive is the page. BTree nodes are layedout in one or more pages, each page has some binary metadataexplaining it's type and organization (free list, etc). A short keyvalue is written directly into the page, a long one is written into anoverflow page, etc. Lots of details to deal with in managing variablesized data on disk. Pages that are dirty are kept in memory (which iswhy BDB can run out of transaction space; the pages overflow the maxcache size when you are writing lots of data).

However, to get started, the easiest thing is to reuse the existingmemutils serializer, not worry about multi-process operation and notworry about fragmentation, sharing space and maintaining free lists(except perhaps for btree nodes).


Something like:

- btree nodes only keep pointers to variable sized keys storedelsewhere in the file- new keys and values of differing or less length are written inplace, otherwise new

  space is allocated at the end of the file.

- btree nodes are a fixed size page on-disk and keep some free-listinformation so we can reuse them.- transactions simply keep track of the primitive operations on thedatabase and the associated data in a memory queue and write those opsto disk as part of the txn commit process. The pages and key/valuepairs that will be touched in that operation are also stored in thattxn log.- when a transaction commits, it replays the log to write everythingto disk appropriately. The list of touched data is then passed up thecommit chain to invalidate any pending transactions that have aconflict. Everything is speculative in this case, but we don't haveto deal with locking.

This is a nice balance between some lisp-sexp serialization formatthat performs poorly, and a highly-optimized low-level implementationwhich is blindingly fast.


A big decision is:

- Use cffi/uffi and do much of the serialization & btreeimplementation in C/static memoryor do all of this in pools of static arrays and write a newserializer to operate on lisp data.

I lean towards using cffi to manipulate static data, just because it'sgoing to be easier to get performance via that method and it's alsogoing to be much easier to do a multi-process implementation (byoperating on static memory and primitive locks in a shared memoryregion).

Predicated on that decision, getting started on the simplest possiblebtree/dup-btree implementation is the next, most valuable andeducational step.


The key pieces for a single-process lisp backend:

- btrees and dup-btrees (indices can be built from these two easilyenough)- the binary pages could be stored in static data and theprimitives btree opscould directly manipulate data within the page? We pass a Cfunction thatdirectly sorts binary sequences rather than having to deserializeto sort. We'dneed to write that in lisp to operate on static data or on lisparrays. Deserializing

    on each key comparison is too expensive.
- a set of transaction records (lisp structs and consts)

- simply keeps tuples of (op {rd | wr} {btree+page-offset | value-offset} [values])in a memory queue. Could use static memory for this to reduceload on GC

- a blocking primitive library that serializes txn commits

(i.e. write log to disk, write data to disk, write 'commit done' tolog,

   invalidate pending/conflicting txns)

A nice auxiliary hack would be:

- rewrite memutils to entirely use uffi/cffi to manipulate static dataratherthan calling out to C to do it. Maintains efficiency but removesthe compilation

  build step except for users of BDB

So what do people think about the cffi->static-data vs. lisp->array-pool decision?



Ian

On May 13, 2008, at 2:03 PM, Leslie P. Polzer wrote:

I suppose the "binary paging" approach mentioned in the designconsiderations
document means the problem of organizing the data efficiently on disk
right from the start. Is this correct?
Do you think it would make good sense to start working on the btreelibrarywithout thinking much about on-disk efficiency, leaving this partfor later?
I'm not sure a btree where on-disk storage organization is separetedfrom the
rest like that can achieve enough efficiency...

 Thanks,

   Leslie

_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel


_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel

Re: [elephant-devel] Lisp Btrees: design considerations

Reply via email to