On May 14, 2008, at 3:28 AM, Leslie P. Polzer wrote:
For example, in BDB, the primitive is the page. BTree nodes are
layed
out in one or more pages, each page has some binary metadata
explaining it's type and organization (free list, etc). A short key
value is written directly into the page, a long one is written into
an
overflow page, etc.
InnoDB also uses this approach.
There is a massive body of work and many variations on ways to lay out
indexing structures and data on disk. The tradeoffs depend on your
specific needs. My recommendation is we don't get ambitious and stick
with a simple BTree or perhaps B+Tree for now. If we abstract
properly, we should be able to replace the underlying page storage
mechanism later.
- new keys and values of differing or less length are written in
place, otherwise new
space is allocated at the end of the file.
Okay, but maybe let the user reserve extent space.
- transactions simply keep track of the primitive operations on the
database and the associated data in a memory queue and write those
ops
to disk as part of the txn commit process. The pages and key/value
pairs that will be touched in that operation are also stored in that
txn log.
- when a transaction commits, it replays the log to write everything
to disk appropriately. The list of touched data is then passed up
the
commit chain to invalidate any pending transactions that have a
conflict. Everything is speculative in this case, but we don't have
to deal with locking.
I like this approach.
A big decision is:
- Use cffi/uffi and do much of the serialization & btree
implementation in C/static memory
IME this FFI stuff can be quite hard to debug.
That's true, but the CFFI operations would basically be the primitive
set that are already used in memutils to implement buffer streams.
I lean towards using cffi to manipulate static data, just because
it's
going to be easier to get performance via that method and it's also
going to be much easier to do a multi-process implementation (by
operating on static memory and primitive locks in a shared memory
region).
I cannot estimate the performance trade-offs involved here,
but in general I'm in favor of a Lisp-based approach...
I think if we abstract correctly with a more direct approach in mind,
we could always go back and do it the other way later. Probably best
to start with lisp, it just may mean more work in the meantime to
rewrite the functionality in memutils and serializer2.
- the binary pages could be stored in static data
I don't think I understand this. What does "static" mean here
(and above)?
Static data just means that allocated from the C heap, not the lisp
garbage collector. Lisp data structures
and the primitives btree ops
could directly manipulate data within the page? We pass a C
function that
directly sorts binary sequences rather than having to deserialize
to sort. We'd
need to write that in lisp to operate on static data or on lisp
arrays. Deserializing
on each key comparison is too expensive.
Yes.
A nice auxiliary hack would be:
- rewrite memutils to entirely use uffi/cffi to manipulate static
data
rather
than calling out to C to do it. Maintains efficiency but removes
the compilation
build step except for users of BDB
I have looked at memutils/the serializer, but it's very hard for me
to replace it by something else, because I'm not sure what would be
required of a replacement.
The partial conclusion to which I came was that memutils just models
a bivalent stream so the backend can communicate with the serializer.
That's essentially correct, although memutils was written to implement
buffer streams which are then used by the serializer - in effect it's
the serializer that uses memutils to send serialized buffers to the
data stores. It was originally designed for BDB and has the benefit
of avoiding an extra copy step when talking to the BDB API. If you
are serializing structures into a lisp array, to pass them to C you
need to copy them to a C array, and BDB then copies that array into
the appropriate cached page and ultimately writes that page to disk.
Using memutils as-is is actually more expensive for a lisp-only
solution if we're storing pages in lisp arrays (which hopefully will
quickly become tenured and stop being copied around by the
collector). We would then write into a C array with memutils, copy
into a lisp array and write that to disk rather than writing directly
to the lisp array.
The data format is hard to figure out for me, however, because of
all the UFFI stuff involved...
The serializer is probably fine as it is from a functional standpoint,
but to do it in lisp we'll need to change out all the primitives it
uses (buffer-write-int32, buffer-write-byte, etc).
buffers streams are wrappers around C arrays
(defstruct buffer-stream
"A stream-like interface to foreign (alien) char buffers."
(buffer (allocate-foreign-object :unsigned-char 10) :type array-or-
pointer-char)
(size 0 :type fixnum)
(position 0 :type fixnum)
(length 10 :type fixnum))
We allocate that object directly from the C heap (just like malloc)
using the allocate-foreign-object call, this is currently in uffi and
using the uffi compat layer from cffi leads to errors.
Size is the current size of valid data in the array (a write
pointer). The position is the current read pointer and the length is
the total size of the allocated region.
If I write two int32's into the stream the values above are:
size 8, position 0, length 10
If the BTree node binary data is stored in a lisp octet array, and we
don't want to deserialize to compare keys, then we need to write the
procedure that you can find in libberkeley-db.c in the src/db-bdb
file. It performs a content-sensitive comparison on a byte-by-byte
basis. This has huge benefits of allowing us to do key comparisons
'in-place' without creating lisp objects. We pass this C function to
BDB which uses it to compare keys and values directly in the binary
file pages.
I tell you what, if we wrap an abstraction around a lisp paged btree
approach, we can cheat with memutils for the time being and replace it
later. Every field written to a binary page should have an extent
associated with it. To compare keys, we copy that extent to a buffer
stream, run the comparison operation in the existing memutils, etc.
To serialize/deserialize, we just copy to/from the lisp binary page
and the buffer stream abstraction.
To lookup data in a page you would do something like:
serialize key to a buffer stream (bs1)
database file => read fixed sized page into an array
page => copy serialized key to buffer stream (bs2)
compare bs1 and bs2
look at btree page to decide what next key to compare or,
copy the value to a buffer stream bs3 and run the deserializer.
Later this should be something like:
serialize key to a lisp array
database file => read fixed sized page into an array
byte-by-byte comparison of key in lisp array to a field at offset +
length in the binary page
on success, deserialize from the offset + length field in the binary
page.
I'm sorry if this is confusing. I have some code dealing with fixed
pools of binary pages and some attempts at a btree and log
implementation in src/contrib/db-lisp. Rucksack is also an excellent
source of ideas. the Rucksack persistence model is too different from
elephant for me to be comfortable adapting it, but it has a much more
elegant serializer and an all-in-lisp implementation of btrees (as I
recall). That is another great place to start getting ideas.
Ian
Leslie
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel