Re: 99.999% uptime - Operations Best Practices?

Ryan King Wed, 22 Jun 2011 14:33:04 -0700

On Wed, Jun 22, 2011 at 2:24 PM, Les Hazlewood <l...@katasoft.com> wrote:
> I'm planning on using Cassandra as a product's core data store, and it is
> imperative that it never goes down or loses data, even in the event of a
> data center failure.  This uptime requirement ("five nines": 99.999% uptime)
> w/ WAN capabilities is largely what led me to choose Cassandra over other
> NoSQL products, given its history and 'from the ground up' design for such
> operational benefits.
> However, in a recent thread, a user indicated that all 4 of 4 of his
> Cassandra instances were down because the OS killed the Java processes due
> to memory starvation, and all 4 instances went down in a relatively short
> period of time of each other.  Another user helped out and replied that
> running 0.8 and nodetool repair on each node regularly via a cron job (once
> a day?) seems to work for him.
> Naturally this was disconcerting to read, given our needs for a Highly
> Available product - we'd be royally screwed if this ever happened to us.
>  But given Cassandra's history and it's current production use, I'm aware
> that this HA/uptime is being achieved today, and I believe it is certainly
> achievable.
> So, is there a collective set of guidelines or best practices to ensure this
> problem (or unavailability due to OOM) can be easily managed?
> Things like memory settings, initial GC recommendations, cron
> recommendations, ulimit settings, etc. that can be bundled up as a
> best-practices "Production Kickstart"?


Unfortunately most of these are in the category of "it depends".

-ryan

> Could anyone share their nuggets of wisdom or point me to resources where
> this may already exist?
> Thanks!
> Best regards,
> Les
>

Re: 99.999% uptime - Operations Best Practices?

Reply via email to