On Wed, Jun 22, 2011 at 2:24 PM, Les Hazlewood <l...@katasoft.com> wrote: > I'm planning on using Cassandra as a product's core data store, and it is > imperative that it never goes down or loses data, even in the event of a > data center failure. This uptime requirement ("five nines": 99.999% uptime) > w/ WAN capabilities is largely what led me to choose Cassandra over other > NoSQL products, given its history and 'from the ground up' design for such > operational benefits. > However, in a recent thread, a user indicated that all 4 of 4 of his > Cassandra instances were down because the OS killed the Java processes due > to memory starvation, and all 4 instances went down in a relatively short > period of time of each other. Another user helped out and replied that > running 0.8 and nodetool repair on each node regularly via a cron job (once > a day?) seems to work for him. > Naturally this was disconcerting to read, given our needs for a Highly > Available product - we'd be royally screwed if this ever happened to us. > But given Cassandra's history and it's current production use, I'm aware > that this HA/uptime is being achieved today, and I believe it is certainly > achievable. > So, is there a collective set of guidelines or best practices to ensure this > problem (or unavailability due to OOM) can be easily managed? > Things like memory settings, initial GC recommendations, cron > recommendations, ulimit settings, etc. that can be bundled up as a > best-practices "Production Kickstart"?
Unfortunately most of these are in the category of "it depends". -ryan > Could anyone share their nuggets of wisdom or point me to resources where > this may already exist? > Thanks! > Best regards, > Les >