> Log structural database  has the append-only characteristics, e.g. BDB-JE.
> Is it an alternative for SSTable? Those matured database product might have
> done a lot for cache management. Not sure whether it can improve the
> performance of read or not.

BDB JE seems to be targetted mostly at cases where data fits in RAM,
or reasonably close to it. A problem is that while writes will be
append-only as long as the database is sufficiently small, you start
taking reads once the internal btree nodes no longer fit in RAM. So
depending on cache size, at a certain number of keys (thus size of the
btree) you start being seek-bound on reads while writing, even though
the writes are in and of themselves append-only and not subject to
seek overhead.

Another effect, which I have not specifically confirmed in testing but
expect to happen, is that once you reach the point this point of
taking reads, compaction is probably going to be a lot more expensive.
While normally JE can pick a log segment with the most garbage and
mostly stream through it, re-writing non-garbage, that process will
then also become entirely seek bound if a only a small subset of the
btree fits in RAM. So now you have a seek bound compaction process
that must keep up with the append-only write process, meaning that
your append-only writes are limited by said seeks in addition to any
seeks it takes "directly" when generating the writes.

Also keep in mind that JE won't have on-disk locality for neither
internal nodes nor leaf (data) nodes.

The guaranteed append-only nature of Cassandra, in combination with
the on-disk locality, is one reason to prefer it, under some
circumstances, over JE even for non-clustered local use on a single
machine.

(As a parenthesis: I doubt JE is being used very much with huge
databases, since a very significant CPU bottleneck became O(n) (with
respect to the number of log segments) file listings. This is probably
easily patched, or configured away by using larger log segments, but
the repeated O(n) file listings suggest to me that huge databases is
not an expected use case - beyond some hints in the documentation that
would indicate it's meant for smaller databases.)

-- 
/ Peter Schuller

Reply via email to