On Thu, Aug 9, 2012 at 10:29 AM, Felix Meschberger <[email protected]> wrote: > Hi, > > Interesting thoughts. > > To double on the in-memory assumption of the complete tree, I'd like to add > that the most dramatic improvement in overall Jackrabbit performance on a > configuration can probably be reached by increasing the bundle cache size > which eventually is more or less what you are proposing.
Yes that is our experience too. I still want to make the bundle cache pluggable (hopefully I can find some cycles somewhere to work on it) so you can plugin a distributed cache. The "problem" with the bundle cache is that it is local, so each cluster node requires quite a bit of memory to have a decent cache for large installations. It would also "solve" the cache warming problem on restarts. The downside of course is the extra infrastructure to set it up and probably a little bit of performance loss. You can't beat a local jvm heap cache ;-) But this discussion belongs on the jr-dev list ;-) Regards, Bart > > Regards > Felix > > Am 07.08.2012 um 16:07 schrieb Jukka Zitting: > >> Hi, >> >> [Just throwing an idea around, no active plans for further work on this.] >> >> One of the biggest performance bottlenecks with current repository >> implementations is disk speed, especially seek times but also raw data >> transfer rate in many cases. To work around those limitations we've in >> Jackrabbit used various caching strategies that considerably >> complicate the codebase and still have trouble with cache misses and >> write-through performance. >> >> As an alternative to such designs, I was thinking of a microkernel >> implementation that would keep the *entire* tree structure in memory, >> i.e. only use the disk or another backend for binaries and possibly >> for periodic backup dumps. Fault tolerance against hardware failures >> or other restarts would be achieved by requiring a clustered >> deployment where all content is kept as copies on at least three >> separate physical servers. Redis (http://redis.io/) is a good example >> of the potential performance gains of such a design. >> >> To estimate how much memory such a model would need, I looked at the >> average bundle size of a vanilla CQ5 installation. There the average >> bundle (i.e. a node with all its properties and child node references) >> size is just 251 bytes. Even assuming larger bundles and some level of >> storage and index overhead it seems safe to assume up to about 1kB of >> memory per node on average. That would allow one to store some 1M >> nodes in each 1GB of memory. >> >> Assuming that all content is evenly spread across the cluster in a way >> that puts copies of each individual bundle on at least three different >> cluster nodes and that each cluster node additionally keeps a large >> cache of most frequently accessed content, a large repository with >> 100+M content nodes could easily run on a twelve-node cluster where >> each cluster node has 32GB RAM, a reasonable size for a modern server >> (also available from EC2 as m2.2xlarge). A mid-size repository with >> 10+M content nodes could run on a three- or four-node cluster with >> just 16GB RAM per cluster node (or m2.xlarge in EC2). >> >> I believe such a microkernel could set a pretty high bar on >> performance! The only major performance limit I foresee is the network >> overhead when writing (need to send updates to other cluster nodes) >> and during cache misses (need to retrieve data from other nodes), but >> the cache misses would only start affecting repositories that go >> beyond what fits in memory on a single server (i.e. the mid-size >> repository described above wouldn't yet be hit by that limit) and the >> write overhead could be amortized by allowing the nodes to temporarily >> diverge until they have a chance to sync up again in the background >> (as allowed by the MK contract). >> >> BR, >> >> Jukka Zitting > -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com
