> Log structural database has the append-only characteristics, e.g. BDB-JE. > Is it an alternative for SSTable? Those matured database product might have > done a lot for cache management. Not sure whether it can improve the > performance of read or not.
BDB JE seems to be targetted mostly at cases where data fits in RAM, or reasonably close to it. A problem is that while writes will be append-only as long as the database is sufficiently small, you start taking reads once the internal btree nodes no longer fit in RAM. So depending on cache size, at a certain number of keys (thus size of the btree) you start being seek-bound on reads while writing, even though the writes are in and of themselves append-only and not subject to seek overhead. Another effect, which I have not specifically confirmed in testing but expect to happen, is that once you reach the point this point of taking reads, compaction is probably going to be a lot more expensive. While normally JE can pick a log segment with the most garbage and mostly stream through it, re-writing non-garbage, that process will then also become entirely seek bound if a only a small subset of the btree fits in RAM. So now you have a seek bound compaction process that must keep up with the append-only write process, meaning that your append-only writes are limited by said seeks in addition to any seeks it takes "directly" when generating the writes. Also keep in mind that JE won't have on-disk locality for neither internal nodes nor leaf (data) nodes. The guaranteed append-only nature of Cassandra, in combination with the on-disk locality, is one reason to prefer it, under some circumstances, over JE even for non-clustered local use on a single machine. (As a parenthesis: I doubt JE is being used very much with huge databases, since a very significant CPU bottleneck became O(n) (with respect to the number of log segments) file listings. This is probably easily patched, or configured away by using larger log segments, but the repeated O(n) file listings suggest to me that huge databases is not an expected use case - beyond some hints in the documentation that would indicate it's meant for smaller databases.) -- / Peter Schuller