Re: [elephant-devel] Lisp Btrees: design considerations

Ian Eslick Wed, 14 May 2008 06:41:35 -0700


On May 14, 2008, at 3:28 AM, Leslie P. Polzer wrote:

For example, in BDB, the primitive is the page. BTree nodes arelayed
out in one or more pages, each page has some binary metadata
explaining it's type and organization (free list, etc).  A short key
value is written directly into the page, a long one is written intoan
overflow page, etc.
InnoDB also uses this approach.

There is a massive body of work and many variations on ways to lay outindexing structures and data on disk. The tradeoffs depend on yourspecific needs. My recommendation is we don't get ambitious and stickwith a simple BTree or perhaps B+Tree for now. If we abstractproperly, we should be able to replace the underlying page storagemechanism later.

- new keys and values of differing or less length are written in
place, otherwise new
  space is allocated at the end of the file.


Okay, but maybe let the user reserve extent space.

- transactions simply keep track of the primitive operations on the

database and the associated data in a memory queue and write thoseops

to disk as part of the txn commit process.  The pages and key/value
pairs that will be touched in that operation are also stored in that
txn log.
- when a transaction commits, it replays the log to write everything

to disk appropriately. The list of touched data is then passed upthe

commit chain to invalidate any pending transactions that have a
conflict.  Everything is speculative in this case, but we don't have
to deal with locking.


I like this approach.

A big decision is:
- Use cffi/uffi and do much of the serialization & btree
implementation in C/static memory


IME this FFI stuff can be quite hard to debug.

That's true, but the CFFI operations would basically be the primitiveset that are already used in memutils to implement buffer streams.

I lean towards using cffi to manipulate static data, just becauseit's

going to be easier to get performance via that method and it's also
going to be much easier to do a multi-process implementation (by
operating on static memory and primitive locks in a shared memory
region).


I cannot estimate the performance trade-offs involved here,
but in general I'm in favor of a Lisp-based approach...

I think if we abstract correctly with a more direct approach in mind,we could always go back and do it the other way later. Probably bestto start with lisp, it just may mean more work in the meantime torewrite the functionality in memutils and serializer2.

  - the binary pages could be stored in static data


I don't think I understand this. What does "static" mean here
(and above)?

Static data just means that allocated from the C heap, not the lispgarbage collector. Lisp data structures

and the primitives btree ops
    could directly manipulate data within the page?  We pass a C
function that
    directly sorts binary sequences rather than having to deserialize
to sort.  We'd
    need to write that in lisp to operate on static data or on lisp
arrays.  Deserializing
    on each key comparison is too expensive.


Yes.

A nice auxiliary hack would be:

- rewrite memutils to entirely use uffi/cffi to manipulate staticdata

rather
  than calling out to C to do it.  Maintains efficiency but removes
the compilation
  build step except for users of BDB


I have looked at memutils/the serializer, but it's very hard for me
to replace it by something else, because I'm not sure what would be
required of a replacement.

The partial conclusion to which I came was that memutils just models
a bivalent stream so the backend can communicate with the serializer.

That's essentially correct, although memutils was written to implementbuffer streams which are then used by the serializer - in effect it'sthe serializer that uses memutils to send serialized buffers to thedata stores. It was originally designed for BDB and has the benefitof avoiding an extra copy step when talking to the BDB API. If youare serializing structures into a lisp array, to pass them to C youneed to copy them to a C array, and BDB then copies that array intothe appropriate cached page and ultimately writes that page to disk.

Using memutils as-is is actually more expensive for a lisp-onlysolution if we're storing pages in lisp arrays (which hopefully willquickly become tenured and stop being copied around by thecollector). We would then write into a C array with memutils, copyinto a lisp array and write that to disk rather than writing directlyto the lisp array.

The data format is hard to figure out for me, however, because of
all the UFFI stuff involved...

The serializer is probably fine as it is from a functional standpoint,but to do it in lisp we'll need to change out all the primitives ituses (buffer-write-int32, buffer-write-byte, etc).


buffers streams are wrappers around C arrays

(defstruct buffer-stream
  "A stream-like interface to foreign (alien) char buffers."

(buffer (allocate-foreign-object :unsigned-char 10) :type array-or-pointer-char)

  (size 0 :type fixnum)
  (position 0 :type fixnum)
  (length 10 :type fixnum))

We allocate that object directly from the C heap (just like malloc)using the allocate-foreign-object call, this is currently in uffi andusing the uffi compat layer from cffi leads to errors.

Size is the current size of valid data in the array (a writepointer). The position is the current read pointer and the length isthe total size of the allocated region.


If I write two int32's into the stream the values above are:
size 8, position 0, length 10

If the BTree node binary data is stored in a lisp octet array, and wedon't want to deserialize to compare keys, then we need to write theprocedure that you can find in libberkeley-db.c in the src/db-bdbfile. It performs a content-sensitive comparison on a byte-by-bytebasis. This has huge benefits of allowing us to do key comparisons'in-place' without creating lisp objects. We pass this C function toBDB which uses it to compare keys and values directly in the binaryfile pages.

I tell you what, if we wrap an abstraction around a lisp paged btreeapproach, we can cheat with memutils for the time being and replace itlater. Every field written to a binary page should have an extentassociated with it. To compare keys, we copy that extent to a bufferstream, run the comparison operation in the existing memutils, etc.To serialize/deserialize, we just copy to/from the lisp binary pageand the buffer stream abstraction.


To lookup data in a page you would do something like:

serialize key to a buffer stream (bs1)
database file => read fixed sized page into an array
page => copy serialized key to buffer stream (bs2)
compare bs1 and bs2
look at btree page to decide what next key to compare or,
copy the value to a buffer stream bs3 and run the deserializer.

Later this should be something like:
serialize key to a lisp array
database file => read fixed sized page into an array

byte-by-byte comparison of key in lisp array to a field at offset +length in the binary pageon success, deserialize from the offset + length field in the binarypage.

I'm sorry if this is confusing. I have some code dealing with fixedpools of binary pages and some attempts at a btree and logimplementation in src/contrib/db-lisp. Rucksack is also an excellentsource of ideas. the Rucksack persistence model is too different fromelephant for me to be comfortable adapting it, but it has a much moreelegant serializer and an all-in-lisp implementation of btrees (as Irecall). That is another great place to start getting ideas.

Ian

 Leslie

_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel


_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel

Re: [elephant-devel] Lisp Btrees: design considerations

Reply via email to