On May 14, 2008, at 3:28 AM, Leslie P. Polzer wrote:


For example, in BDB, the primitive is the page. BTree nodes are layed
out in one or more pages, each page has some binary metadata
explaining it's type and organization (free list, etc).  A short key
value is written directly into the page, a long one is written into an
overflow page, etc.

InnoDB also uses this approach.

There is a massive body of work and many variations on ways to lay out indexing structures and data on disk. The tradeoffs depend on your specific needs. My recommendation is we don't get ambitious and stick with a simple BTree or perhaps B+Tree for now. If we abstract properly, we should be able to replace the underlying page storage mechanism later.


- new keys and values of differing or less length are written in
place, otherwise new
  space is allocated at the end of the file.

Okay, but maybe let the user reserve extent space.



- transactions simply keep track of the primitive operations on the
database and the associated data in a memory queue and write those ops
to disk as part of the txn commit process.  The pages and key/value
pairs that will be touched in that operation are also stored in that
txn log.
- when a transaction commits, it replays the log to write everything
to disk appropriately. The list of touched data is then passed up the
commit chain to invalidate any pending transactions that have a
conflict.  Everything is speculative in this case, but we don't have
to deal with locking.

I like this approach.


A big decision is:
- Use cffi/uffi and do much of the serialization & btree
implementation in C/static memory

IME this FFI stuff can be quite hard to debug.

That's true, but the CFFI operations would basically be the primitive set that are already used in memutils to implement buffer streams.


I lean towards using cffi to manipulate static data, just because it's
going to be easier to get performance via that method and it's also
going to be much easier to do a multi-process implementation (by
operating on static memory and primitive locks in a shared memory
region).

I cannot estimate the performance trade-offs involved here,
but in general I'm in favor of a Lisp-based approach...

I think if we abstract correctly with a more direct approach in mind, we could always go back and do it the other way later. Probably best to start with lisp, it just may mean more work in the meantime to rewrite the functionality in memutils and serializer2.


  - the binary pages could be stored in static data

I don't think I understand this. What does "static" mean here
(and above)?

Static data just means that allocated from the C heap, not the lisp garbage collector. Lisp data structures


and the primitives btree ops
    could directly manipulate data within the page?  We pass a C
function that
    directly sorts binary sequences rather than having to deserialize
to sort.  We'd
    need to write that in lisp to operate on static data or on lisp
arrays.  Deserializing
    on each key comparison is too expensive.

Yes.


A nice auxiliary hack would be:
- rewrite memutils to entirely use uffi/cffi to manipulate static data
rather
  than calling out to C to do it.  Maintains efficiency but removes
the compilation
  build step except for users of BDB

I have looked at memutils/the serializer, but it's very hard for me
to replace it by something else, because I'm not sure what would be
required of a replacement.

The partial conclusion to which I came was that memutils just models
a bivalent stream so the backend can communicate with the serializer.

That's essentially correct, although memutils was written to implement buffer streams which are then used by the serializer - in effect it's the serializer that uses memutils to send serialized buffers to the data stores. It was originally designed for BDB and has the benefit of avoiding an extra copy step when talking to the BDB API. If you are serializing structures into a lisp array, to pass them to C you need to copy them to a C array, and BDB then copies that array into the appropriate cached page and ultimately writes that page to disk.

Using memutils as-is is actually more expensive for a lisp-only solution if we're storing pages in lisp arrays (which hopefully will quickly become tenured and stop being copied around by the collector). We would then write into a C array with memutils, copy into a lisp array and write that to disk rather than writing directly to the lisp array.

The data format is hard to figure out for me, however, because of
all the UFFI stuff involved...

The serializer is probably fine as it is from a functional standpoint, but to do it in lisp we'll need to change out all the primitives it uses (buffer-write-int32, buffer-write-byte, etc).

buffers streams are wrappers around C arrays

(defstruct buffer-stream
  "A stream-like interface to foreign (alien) char buffers."
(buffer (allocate-foreign-object :unsigned-char 10) :type array-or- pointer-char)
  (size 0 :type fixnum)
  (position 0 :type fixnum)
  (length 10 :type fixnum))

We allocate that object directly from the C heap (just like malloc) using the allocate-foreign-object call, this is currently in uffi and using the uffi compat layer from cffi leads to errors.

Size is the current size of valid data in the array (a write pointer). The position is the current read pointer and the length is the total size of the allocated region.

If I write two int32's into the stream the values above are:
size 8, position 0, length 10


If the BTree node binary data is stored in a lisp octet array, and we don't want to deserialize to compare keys, then we need to write the procedure that you can find in libberkeley-db.c in the src/db-bdb file. It performs a content-sensitive comparison on a byte-by-byte basis. This has huge benefits of allowing us to do key comparisons 'in-place' without creating lisp objects. We pass this C function to BDB which uses it to compare keys and values directly in the binary file pages.

I tell you what, if we wrap an abstraction around a lisp paged btree approach, we can cheat with memutils for the time being and replace it later. Every field written to a binary page should have an extent associated with it. To compare keys, we copy that extent to a buffer stream, run the comparison operation in the existing memutils, etc. To serialize/deserialize, we just copy to/from the lisp binary page and the buffer stream abstraction.

To lookup data in a page you would do something like:

serialize key to a buffer stream (bs1)
database file => read fixed sized page into an array
page => copy serialized key to buffer stream (bs2)
compare bs1 and bs2
look at btree page to decide what next key to compare or,
copy the value to a buffer stream bs3 and run the deserializer.

Later this should be something like:
serialize key to a lisp array
database file => read fixed sized page into an array
byte-by-byte comparison of key in lisp array to a field at offset + length in the binary page on success, deserialize from the offset + length field in the binary page.

I'm sorry if this is confusing. I have some code dealing with fixed pools of binary pages and some attempts at a btree and log implementation in src/contrib/db-lisp. Rucksack is also an excellent source of ideas. the Rucksack persistence model is too different from elephant for me to be comfortable adapting it, but it has a much more elegant serializer and an all-in-lisp implementation of btrees (as I recall). That is another great place to start getting ideas.

Ian


 Leslie

_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel

_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel

Reply via email to