Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread David Boxenhorn
I think very high uptime, and very low data loss is achievable in Cassandra, but, for new users there are TONS of gotchas. You really have to know what you're doing, and I doubt that many people acquire that knowledge without making a lot of mistakes. I see above that most people are talking

Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Karl Hiramoto
On 06/23/11 09:43, David Boxenhorn wrote: I think very high uptime, and very low data loss is achievable in Cassandra, but, for new users there are TONS of gotchas. You really have to know what you're doing, and I doubt that many people acquire that knowledge without making a lot of mistakes.

Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Dominic Williams
Les, Cassandra is a good system, but it has not reached version 1.0 yet, nor has HBase etc. It is cutting edge technology and therefore in practice you are unlikely to achieve five nines immediately - even if in theory with perfect planning, perfect administration and so on, this should be

Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Chris Burroughs
On 06/22/2011 10:03 PM, Edward Capriolo wrote: I have not read the original thread concerning the problem you mentioned. One way to avoid OOM is large amounts of RAM :) On a more serious note most OOM's are caused by setting caches or memtables too large. If the OOM was caused by a software

Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Chris Burroughs
On 06/22/2011 07:12 PM, Les Hazlewood wrote: Telling me to read the mailing lists and follow the issue tracker and use monitoring software is all great and fine - and I do all of these things today already - but this is a philosophical recommendation that does not actually address my question.

Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Les Hazlewood
Great stuff Chris - thanks so much for the feedback! Les

Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Les Hazlewood
In the spirit of your re-formulated questions: - Read-before-write is a Cassandra anti-pattern, avoid it if at all possible. This leads me to believe that Cassandra may not be a good idea for a primary OLTP data store. For example only create a user object if email foo is not already in

Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Chris Burroughs
On 06/23/2011 01:56 PM, Les Hazlewood wrote: Is there a roadmap or time to 1.0? Even a ballpark time (e.g next year 3rd quarter, end of year, etc) would be great as it would help me understand where it may lie in relation to my production rollout. The C* devs are rather strongly inclined

Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Nate McCall
As an additional concrete detail to Edward's response, 'result pinning' can provide some performance improvements depending on topology and workload. See the conf file comments for details: https://github.com/apache/cassandra/blob/cassandra-0.8.0/conf/cassandra.yaml#L308-315 I would also advise

99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
I'm planning on using Cassandra as a product's core data store, and it is imperative that it never goes down or loses data, even in the event of a data center failure. This uptime requirement (five nines: 99.999% uptime) w/ WAN capabilities is largely what led me to choose Cassandra over other

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Ryan King
On Wed, Jun 22, 2011 at 2:24 PM, Les Hazlewood l...@katasoft.com wrote: I'm planning on using Cassandra as a product's core data store, and it is imperative that it never goes down or loses data, even in the event of a data center failure.  This uptime requirement (five nines: 99.999% uptime)

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
Just to be clear: I understand that resources like [1] and [2] exist, and I've read them. I'm just wondering if there are any 'gotchas' that might be missing from that documentation that should be considered and if there are any recommendations in addition to these documents. Thanks, Les [1]

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
I understand that every environment is different and it always 'depends' :) But recommending settings and techniques based on an existing real production environment (like the user's suggestion to run nodetool repair as a regular cron job) is always a better starting point for a new Cassandra

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Sasha Dolgy
Implement monitoring and be proactive...that will stop you waking up to a big surprise. i'm sure there were symltoms leading up to all 4 nodes going down. willing to wager that each node went down at different times and not all went down at once... On Jun 22, 2011 11:50 PM, Les Hazlewood

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Will Oberman
Sadly, they all went down within minutes of each other. Sent from my iPhone On Jun 22, 2011, at 6:16 PM, Sasha Dolgy sdo...@gmail.com wrote: Implement monitoring and be proactive...that will stop you waking up to a big surprise. i'm sure there were symltoms leading up to all 4 nodes going

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Chris Burroughs
On 06/22/2011 05:33 PM, Les Hazlewood wrote: Just to be clear: I understand that resources like [1] and [2] exist, and I've read them. I'm just wondering if there are any 'gotchas' that might be missing from that documentation that should be considered and if there are any recommendations

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Edward Capriolo
Committing to that many 9s is going to be impossible since as far as I know no internet service provier will sla you more the 2 9s . You can not have more uptime then your isp. On Wednesday, June 22, 2011, Chris Burroughs chris.burrou...@gmail.com wrote: On 06/22/2011 05:33 PM, Les Hazlewood

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Peter Lin
you have to use multiple data centers to really deliver 4 or 5 9's of service On Wed, Jun 22, 2011 at 7:09 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Committing to that many 9s is going to be impossible since as far as I know no internet service provier will sla you more the 2 9s . You

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
[1] http://www.datastax.com/docs/0.8/operations/index [2] http://wiki.apache.org/cassandra/Operations Well if they new some secret gotcha the dutiful cassandra operators of the world would update the wiki. As I am new to the Cassandra community, I don't know how 'dutifully' this is

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
On Wed, Jun 22, 2011 at 4:11 PM, Peter Lin wool...@gmail.com wrote: you have to use multiple data centers to really deliver 4 or 5 9's of service We do, hence my question, as well as my choice of Cassandra :) Best, Les

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread mcasandra
In my opinion 5 9s don't matter. It's the number of impacted customers. You might be down during peak for 5 mts causing 1000s of customer turn aways while you might be down during night causing only few customer turn aways. There is no magic bullet. It's all about learning and improving. You will

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Peter Lin
so having multiple data centers is step 1 of 4/5 9's. I've worked on some services that had 3-4 9's SLA. Getting there is really tough as others have stated. you have to auditing built into your service, capacity metrics, capacity planning, some kind of real-time monitoring, staff to respond to

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
Forget the 5 9's - I apologize for even writing that. It was my shorthand way of saying 'this can never go down'. I'm not asking for philosophical advice - I've been doing large scale enterprise deployments for over 10 years. I 'get' the 'it depends' and 'do your homework' philosophy. All I'm

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread mcasandra
Start with reading comments on cassandra.yaml and http://wiki.apache.org/cassandra/Operations http://wiki.apache.org/cassandra/Operations As far as I know there is no comprehensive list for performance tuning. More specifically common setting applicable to everyone. For most part issues revolve

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
I have architected, built and been responsible for systems that support 4-5 9s for years. This discussion is not about how to do that generally. It was intended to be about concrete techniques that have been found valuable when deploying Cassandra in HA environments beyond what is documented in

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
Yep, that was [2] on my existing list. Thanks very much for actually addressing my question - it is greatly appreciated! If anyone else has examples they'd like to share (like their own cron techniques, or JVM settings and why, etc), I'd love to hear them! Best regards, Les On Wed, Jun 22,

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread mcasandra
Les Hazlewood wrote: I have architected, built and been responsible for systems that support 4-5 9s for years. So have most of us. But probably by now it should be clear that no technology can provide concrete recommendations. They can only provide what might be helpful which varies from

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
On Wed, Jun 22, 2011 at 4:35 PM, mcasandra mohitanch...@gmail.com wrote: might be helpful which varies from env to env. That's why I suggest look at the comments in cassandra.yaml and see which are applicable in your scenario. I learn something new everytime I read it. Yep, and this was

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread C. Scott Andreas
Hi Les, I wanted to offer a couple thoughts on where to start and strategies for approaching development and deployment with reliability in mind. One way that we've found to more productively think about the reliability of our data tier is to focus our thoughts away from a concept of uptime or

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Thoku Hansen
I think that Les's question was reasonable. Why *not* ask the community for the 'gotchas'? Whether the info is already documented or not, it could be an opportunity to improve the documentation based on users' perception. The you just have to learn responses are fair also, but that reminds me

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
Hi Scott, First, let me say that this email was amazing - I'm always appreciative of the time that anyone puts into mailing list replies, especially ones as thorough, well-thought and articulated as this one. I'm a firm believer that these types of replies reflect a strong and durable

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
Hi Thoku, You were able to more concisely represent my intentions (and their reasoning) in this thread than I was able to do so myself. Thanks! On Wed, Jun 22, 2011 at 5:14 PM, Thoku Hansen tho...@gmail.com wrote: I think that Les's question was reasonable. Why *not* ask the community for

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Edward Capriolo
On Wed, Jun 22, 2011 at 8:31 PM, Les Hazlewood l...@katasoft.com wrote: Hi Thoku, You were able to more concisely represent my intentions (and their reasoning) in this thread than I was able to do so myself. Thanks! On Wed, Jun 22, 2011 at 5:14 PM, Thoku Hansen tho...@gmail.com wrote: I

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
Edward, Thank you so much for this reply - this is great stuff, and I really appreciate it. You'll be happy to know that I've already pre-ordered your book. I'm looking forward to it! (When is the ship date?) Best regards, Les On Wed, Jun 22, 2011 at 7:03 PM, Edward Capriolo