I am certainly in favor of using fixed-size pages as a basic model. I
think this is very common. It has the advantage of allowing page-
based
caching more easily. One can even contemplate things like a RAID-
style
striping across machines on a LAN or WAN.
I think we should create a really pure-lisp solution, that doesn't
on C
in any way. Certainly we should do that first.
However, I have personal reason for preferring a completely LISP-based
serializer: I think there is a lot of opportunity for major
performance
enhancements by making the serializer more efficient---by which I mean
producing very compressed representations of the data serialized. For
example, if we could produce a 10-fold reduction in the size of the
compressed data, we would have a game-changing improvement, which
would
probably make us much faster than BDB or any relational system. If
that
is true, then almost any amount of CPU time spent in the codec is
warranted.
A 10-fold improvement may seem unlikely until one stops to consider
the
applicability of Huffman coding of English strings, prefix-forward-
index
encoding of ordered strings, Huffman encoding on our little
type-designators, the possibility of keeping a dictionary in a cached
page for LZW-style compression, the possibility of detecting a Poisson
distribution of integer values by declaration or statistics, etc.
However, I think this is a very "researchy" line of thought. I would
like to have the ability to explore these ideas by writing them up in
LISP code, even though they definitely would represent premature
optimization until we had a basic B-Tree working.
It's not 100% clear to me that C would offer a more performant
solution
in any case. It is probably true that a high-performance serializer
would have to be tuned to various LISP implementations.
On Tue, 2008-05-13 at 15:46 -0400, Ian Eslick wrote:
I think they key decision was what serialization format we're going
to
use for btree nodes, log entries, etc and how that relates to caching
data during transactions, maintaining empty lists, etc.
The current serializer produces a byte sequence. If we continue with
that model, how do we write/read this stuff from disk? How and where
do we store it prior to committing a transaction?
When we create a new key or value as a binary stream within a
transaction, how is it stored in memory? If we want a multi-process,
but non-socket based approach, we need to figure out how to store
data
in shared memory, etc.
For example, in BDB, the primitive is the page. BTree nodes are
layed
out in one or more pages, each page has some binary metadata
explaining it's type and organization (free list, etc). A short key
value is written directly into the page, a long one is written into
an
overflow page, etc. Lots of details to deal with in managing
variable
sized data on disk. Pages that are dirty are kept in memory (which
is
why BDB can run out of transaction space; the pages overflow the max
cache size when you are writing lots of data).
However, to get started, the easiest thing is to reuse the existing
memutils serializer, not worry about multi-process operation and not
worry about fragmentation, sharing space and maintaining free lists
(except perhaps for btree nodes).
Something like:
- btree nodes only keep pointers to variable sized keys stored
elsewhere in the file
- new keys and values of differing or less length are written in
place, otherwise new
space is allocated at the end of the file.
- btree nodes are a fixed size page on-disk and keep some free-list
information so we can reuse them.
- transactions simply keep track of the primitive operations on the
database and the associated data in a memory queue and write those
ops
to disk as part of the txn commit process. The pages and key/value
pairs that will be touched in that operation are also stored in that
txn log.
- when a transaction commits, it replays the log to write everything
to disk appropriately. The list of touched data is then passed up
the
commit chain to invalidate any pending transactions that have a
conflict. Everything is speculative in this case, but we don't have
to deal with locking.
This is a nice balance between some lisp-sexp serialization format
that performs poorly, and a highly-optimized low-level implementation
which is blindingly fast.
A big decision is:
- Use cffi/uffi and do much of the serialization & btree
implementation in C/static memory
or do all of this in pools of static arrays and write a new
serializer to operate on lisp data.
I lean towards using cffi to manipulate static data, just because
it's
going to be easier to get performance via that method and it's also
going to be much easier to do a multi-process implementation (by
operating on static memory and primitive locks in a shared memory
region).
Predicated on that decision, getting started on the simplest possible
btree/dup-btree implementation is the next, most valuable and
educational step.
The key pieces for a single-process lisp backend:
- btrees and dup-btrees (indices can be built from these two easily
enough)
- the binary pages could be stored in static data and the
primitives btree ops
could directly manipulate data within the page? We pass a C
function that
directly sorts binary sequences rather than having to deserialize
to sort. We'd
need to write that in lisp to operate on static data or on lisp
arrays. Deserializing
on each key comparison is too expensive.
- a set of transaction records (lisp structs and consts)
- simply keeps tuples of (op {rd | wr} {btree+page-offset | value-
offset} [values])
in a memory queue. Could use static memory for this to reduce
load on GC
- a blocking primitive library that serializes txn commits
(i.e. write log to disk, write data to disk, write 'commit done' to
log,
invalidate pending/conflicting txns)
A nice auxiliary hack would be:
- rewrite memutils to entirely use uffi/cffi to manipulate static
data
rather
than calling out to C to do it. Maintains efficiency but removes
the compilation
build step except for users of BDB
So what do people think about the cffi->static-data vs. lisp->array-
pool decision?
Ian
On May 13, 2008, at 2:03 PM, Leslie P. Polzer wrote:
I suppose the "binary paging" approach mentioned in the design
considerations
document means the problem of organizing the data efficiently on
disk
right from the start. Is this correct?
Do you think it would make good sense to start working on the btree
library
without thinking much about on-disk efficiency, leaving this part
for later?
I'm not sure a btree where on-disk storage organization is separeted
from the
rest like that can achieve enough efficiency...
Thanks,
Leslie
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel