RE: [EXTERNAL] Re: Running and Managing Large Cassandra Clusters

2020-10-29 Thread Gediminas Blazys
Hey,

@Tom I just wanted to clarify when I mention workload in the last email I meant 
it as the amount of requests the cluster has to serve.

Gediminas

From: Tom van der Woerdt 
Sent: Wednesday, October 28, 2020 14:35
To: user 
Subject: [EXTERNAL] Re: Running and Managing Large Cassandra Clusters

Heya,

We're running version 3.11.7, can't use 3.11.8 as it won't even start 
(CASSANDRA-16091). Our policy is to use LCS for everything unless there's a 
good argument for a different compaction strategy (I don't think we have *any* 
STCS at all other than system keyspaces). Since our nodes are mostly on-prem 
they are generally oversized on cpu count, but when idle the cluster with 360 
nodes ends up using less than two cores *peak* for background tasks like (full, 
weekly) repairs and tombstone compactions. That said they do get 32 logical 
threads because that's what the hardware ships with (-:

Haven't had major problems with Gossip over the years. I think we've had to run 
nodetool assassinate exactly once, a few years ago. Probably the only gossip 
related annoyance is that when you decommission all seed nodes Cassandra will 
happily run a single core at 100% trying to connect until you update the list 
of seeds, but that's really minor.

There's also one cluster that has 50TB nodes, 60 of them, storing reasonably 
large cells (using LCS, previously TWCS, both fine). Replacing a node takes a 
few days, but other than that it's not particularly problematic.

In my experience it's the small clusters that wake you up ;-)

Tom van der Woerdt
Senior Site Reliability Engineer
Booking.com BV
Vijzelstraat Amsterdam Netherlands 1017HL
[Booking.com]<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.booking.com%2F=04%7C01%7CGediminas.Blazys%40microsoft.com%7C49ac72df223f4567734408d87b3deb3b%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637394853158328199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=wnoZ9an0Fz1ePKGR2XYIODNFCKnbYxD03PYAStuzxKE%3D=0>
Making it easier for everyone to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million 
reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)


On Wed, Oct 28, 2020 at 12:32 PM Joshua McKenzie 
mailto:jmcken...@apache.org>> wrote:

A few questions for you Tom if you have 30 seconds and care to disclose:

  1.  What version of C*?
  2.  What compaction strategy?
  3.  What's core count allocated per C* node?
  4.  Gossip give you any headaches / you have to be delicate there or does it 
behave itself?
Context: pmc/committer and I manage the OSS C* team at DataStax. We're doing a 
lot of thinking about how to generally improve the operator experience across 
the board for folks in the post 4.0 time frame, so data like the above (where 
things are going well at scale and why) is super useful to help feed into that 
effort.

Thanks!



On Wed, Oct 28, 2020 at 7:14 AM, Tom van der Woerdt 
mailto:tom.vanderwoe...@booking.com.invalid>>
 wrote:
Does 360 count? :-)

num_tokens is 16, works fine (had 256 on a 300 node cluster as well, not too 
many problems either). Roughly 2.5TB per node, running on-prem on reasonably 
stable hardware so replacements end up happening once a week at most, and 
there's no particular change needed in the automation. Scaling up or down takes 
a while, but it doesn't appear to be slower than any other cluster. 
Configuration wise it's no different than a 5-node cluster either. Pretty 
uneventful tbh.

Tom van der Woerdt
Senior Site Reliability Engineer
Booking.com<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fbooking.com%2F=04%7C01%7CGediminas.Blazys%40microsoft.com%7C49ac72df223f4567734408d87b3deb3b%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637394853158338189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=r5lT3%2BeixKW2I1Lj6aI%2F6TUe0shDgqXLQHxgu%2FbwWTk%3D=0>
 BV
Vijzelstraat Amsterdam Netherlands 1017HL
[Booking.com]<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.booking.com%2F=04%7C01%7CGediminas.Blazys%40microsoft.com%7C49ac72df223f4567734408d87b3deb3b%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637394853158338189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=Ba%2FIbybyy2p%2FsreGLBCxBedwl7mkgyBpy6%2BOPJWPcL8%3D=0>
Making it easier for everyone to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million 
reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)


On Wed, Oct 28, 2020 at 8:58 AM Gediminas Blazys 
mailto:gediminas.bla...@microsoft.com.invalid>>
 wrote:
Hello,

I wanted to seek out your opinion and experience.

Has anyone of you had a chance to run a Cassandra cluster of more than 350 
nodes?
What are the major configuration considerations th

Re: Running and Managing Large Cassandra Clusters

2020-10-28 Thread Tom van der Woerdt
That particular cluster exists for archival purposes, and as such gets a
very low amount of traffic (maybe 5 queries per minute). So not
particularly helpful to answer your question :-) With that said, we've seen
in other clusters that scalability issues are much more likely to come from
hot partitions, hardware change rate (so basically any change to the token
ring, which we never do concurrently), repairs (though largely mitigated
now that we've switched to num_tokens=16), and connection count (sometimes
I'd consider it advisable to configure drivers to *not* establish a
connection to every node, but bound this and let the Cassandra coordinator
route requests instead).

The scalability in terms of client requests/reads/writes tends to be pretty
linear with the node count (and size of course), and on clusters that are
slightly smaller we can see this as well, easily doing hundreds of
thousands to a million queries per second.

As for repairs, we have our own tools for this, but it's fairly similar to
what Reaper does: we take all the ranges in the cluster and then schedule
them to be repaired over the course of a week. No manual `nodetool repair`
invocations, but specific single-range repairs.

Tom van der Woerdt
Senior Site Reliability Engineer

Booking.com BV
Vijzelstraat Amsterdam Netherlands 1017HL
[image: Booking.com] <https://www.booking.com/>
Making it easier for everyone to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
million reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)


On Wed, Oct 28, 2020 at 2:20 PM Gediminas Blazys
 wrote:

> Hey,
>
>
>
> Thanks chipping in Tomas. Could you describe what sort of workload is the
> big cluster receiving in terms of local C* reads, writes and client
> requests as well?
>
>
>
> You mention repairs, how do you run them?
>
>
>
> Gediminas
>
>
>
> *From:* Tom van der Woerdt 
> *Sent:* Wednesday, October 28, 2020 14:35
> *To:* user 
> *Subject:* [EXTERNAL] Re: Running and Managing Large Cassandra Clusters
>
>
>
> Heya,
>
>
>
> We're running version 3.11.7, can't use 3.11.8 as it won't even start
> (CASSANDRA-16091). Our policy is to use LCS for everything unless there's a
> good argument for a different compaction strategy (I don't think we have
> *any* STCS at all other than system keyspaces). Since our nodes are mostly
> on-prem they are generally oversized on cpu count, but when idle the
> cluster with 360 nodes ends up using less than two cores *peak* for
> background tasks like (full, weekly) repairs and tombstone compactions.
> That said they do get 32 logical threads because that's what the hardware
> ships with (-:
>
>
>
> Haven't had major problems with Gossip over the years. I think we've had
> to run nodetool assassinate exactly once, a few years ago. Probably the
> only gossip related annoyance is that when you decommission all seed nodes
> Cassandra will happily run a single core at 100% trying to connect until
> you update the list of seeds, but that's really minor.
>
>
>
> There's also one cluster that has 50TB nodes, 60 of them, storing
> reasonably large cells (using LCS, previously TWCS, both fine). Replacing a
> node takes a few days, but other than that it's not particularly
> problematic.
>
>
>
> In my experience it's the small clusters that wake you up ;-)
>
>
> *Tom van der Woerdt*
>
> Senior Site Reliability Engineer
>
> Booking.com BV
> Vijzelstraat Amsterdam Netherlands 1017HL
>
> *[image: Booking.com]*
> <https://urldefense.com/v3/__https://nam06.safelinks.protection.outlook.com/?url=https*3A*2F*2Fwww.booking.com*2F=04*7C01*7CGediminas.Blazys*40microsoft.com*7C49ac72df223f4567734408d87b3deb3b*7C72f988bf86f141af91ab2d7cd011db47*7C0*7C0*7C637394853158328199*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C1000=wnoZ9an0Fz1ePKGR2XYIODNFCKnbYxD03PYAStuzxKE*3D=0__;JSUlJSUlJSUlJSUlJSUlJSU!!FzMMvhmfRQ!8ClTsEZMT0xcNIA1_EUu62obyz5_K7M-6eMbcN-EoBpl70j7fNXjIJVae3wItRFQzhgzIsc$>
>
> Making it easier for everyone to experience the world since 1996
>
> 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
> million reported listings
> Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)
>
>
>
>
>
> On Wed, Oct 28, 2020 at 12:32 PM Joshua McKenzie 
> wrote:
>
> A few questions for you Tom if you have 30 seconds and care to disclose:
>
>1. What version of C*?
>2. What compaction strategy?
>3. What's core count allocated per C* node?
>4. Gossip give you any headaches / you have to be delicate there or
>does it behave itself?
>
> Context: pmc/committer and I manage the OSS C* team at DataStax. We're
> doing a lot 

RE: [EXTERNAL] Re: Running and Managing Large Cassandra Clusters

2020-10-28 Thread Gediminas Blazys
Hey,

Thanks chipping in Tomas. Could you describe what sort of workload is the big 
cluster receiving in terms of local C* reads, writes and client requests as 
well?

You mention repairs, how do you run them?

Gediminas

From: Tom van der Woerdt 
Sent: Wednesday, October 28, 2020 14:35
To: user 
Subject: [EXTERNAL] Re: Running and Managing Large Cassandra Clusters

Heya,

We're running version 3.11.7, can't use 3.11.8 as it won't even start 
(CASSANDRA-16091). Our policy is to use LCS for everything unless there's a 
good argument for a different compaction strategy (I don't think we have *any* 
STCS at all other than system keyspaces). Since our nodes are mostly on-prem 
they are generally oversized on cpu count, but when idle the cluster with 360 
nodes ends up using less than two cores *peak* for background tasks like (full, 
weekly) repairs and tombstone compactions. That said they do get 32 logical 
threads because that's what the hardware ships with (-:

Haven't had major problems with Gossip over the years. I think we've had to run 
nodetool assassinate exactly once, a few years ago. Probably the only gossip 
related annoyance is that when you decommission all seed nodes Cassandra will 
happily run a single core at 100% trying to connect until you update the list 
of seeds, but that's really minor.

There's also one cluster that has 50TB nodes, 60 of them, storing reasonably 
large cells (using LCS, previously TWCS, both fine). Replacing a node takes a 
few days, but other than that it's not particularly problematic.

In my experience it's the small clusters that wake you up ;-)

Tom van der Woerdt
Senior Site Reliability Engineer
Booking.com BV
Vijzelstraat Amsterdam Netherlands 1017HL
[Booking.com]<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.booking.com%2F=04%7C01%7CGediminas.Blazys%40microsoft.com%7C49ac72df223f4567734408d87b3deb3b%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637394853158328199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=wnoZ9an0Fz1ePKGR2XYIODNFCKnbYxD03PYAStuzxKE%3D=0>
Making it easier for everyone to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million 
reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)


On Wed, Oct 28, 2020 at 12:32 PM Joshua McKenzie 
mailto:jmcken...@apache.org>> wrote:

A few questions for you Tom if you have 30 seconds and care to disclose:

  1.  What version of C*?
  2.  What compaction strategy?
  3.  What's core count allocated per C* node?
  4.  Gossip give you any headaches / you have to be delicate there or does it 
behave itself?
Context: pmc/committer and I manage the OSS C* team at DataStax. We're doing a 
lot of thinking about how to generally improve the operator experience across 
the board for folks in the post 4.0 time frame, so data like the above (where 
things are going well at scale and why) is super useful to help feed into that 
effort.

Thanks!



On Wed, Oct 28, 2020 at 7:14 AM, Tom van der Woerdt 
mailto:tom.vanderwoe...@booking.com.invalid>>
 wrote:
Does 360 count? :-)

num_tokens is 16, works fine (had 256 on a 300 node cluster as well, not too 
many problems either). Roughly 2.5TB per node, running on-prem on reasonably 
stable hardware so replacements end up happening once a week at most, and 
there's no particular change needed in the automation. Scaling up or down takes 
a while, but it doesn't appear to be slower than any other cluster. 
Configuration wise it's no different than a 5-node cluster either. Pretty 
uneventful tbh.

Tom van der Woerdt
Senior Site Reliability Engineer
Booking.com<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fbooking.com%2F=04%7C01%7CGediminas.Blazys%40microsoft.com%7C49ac72df223f4567734408d87b3deb3b%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637394853158338189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=r5lT3%2BeixKW2I1Lj6aI%2F6TUe0shDgqXLQHxgu%2FbwWTk%3D=0>
 BV
Vijzelstraat Amsterdam Netherlands 1017HL
[Booking.com]<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.booking.com%2F=04%7C01%7CGediminas.Blazys%40microsoft.com%7C49ac72df223f4567734408d87b3deb3b%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637394853158338189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=Ba%2FIbybyy2p%2FsreGLBCxBedwl7mkgyBpy6%2BOPJWPcL8%3D=0>
Making it easier for everyone to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million 
reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)


On Wed, Oct 28, 2020 at 8:58 AM Gediminas Blazys 
mailto:gediminas.bla...@microsoft.com.invalid>>
 wrote:
Hello,

I wanted to seek out your opinion and experience.

Has anyone of you had a chance to run a Cassandra cluster of more

Re: Running and Managing Large Cassandra Clusters

2020-10-28 Thread Tom van der Woerdt
Heya,

We're running version 3.11.7, can't use 3.11.8 as it won't even start
(CASSANDRA-16091). Our policy is to use LCS for everything unless there's a
good argument for a different compaction strategy (I don't think we have
*any* STCS at all other than system keyspaces). Since our nodes are mostly
on-prem they are generally oversized on cpu count, but when idle the
cluster with 360 nodes ends up using less than two cores *peak* for
background tasks like (full, weekly) repairs and tombstone compactions.
That said they do get 32 logical threads because that's what the hardware
ships with (-:

Haven't had major problems with Gossip over the years. I think we've had to
run nodetool assassinate exactly once, a few years ago. Probably the only
gossip related annoyance is that when you decommission all seed nodes
Cassandra will happily run a single core at 100% trying to connect until
you update the list of seeds, but that's really minor.

There's also one cluster that has 50TB nodes, 60 of them, storing
reasonably large cells (using LCS, previously TWCS, both fine). Replacing a
node takes a few days, but other than that it's not particularly
problematic.

In my experience it's the small clusters that wake you up ;-)

Tom van der Woerdt
Senior Site Reliability Engineer

Booking.com BV
Vijzelstraat Amsterdam Netherlands 1017HL
[image: Booking.com] 
Making it easier for everyone to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
million reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)


On Wed, Oct 28, 2020 at 12:32 PM Joshua McKenzie 
wrote:

> A few questions for you Tom if you have 30 seconds and care to disclose:
>
>1. What version of C*?
>2. What compaction strategy?
>3. What's core count allocated per C* node?
>4. Gossip give you any headaches / you have to be delicate there or
>does it behave itself?
>
> Context: pmc/committer and I manage the OSS C* team at DataStax. We're
> doing a lot of thinking about how to generally improve the operator
> experience across the board for folks in the post 4.0 time frame, so data
> like the above (where things are going well at scale and why) is super
> useful to help feed into that effort.
>
> Thanks!
>
>
>
> On Wed, Oct 28, 2020 at 7:14 AM, Tom van der Woerdt <
> tom.vanderwoe...@booking.com.invalid> wrote:
>
>> Does 360 count? :-)
>>
>> num_tokens is 16, works fine (had 256 on a 300 node cluster as well, not
>> too many problems either). Roughly 2.5TB per node, running on-prem on
>> reasonably stable hardware so replacements end up happening once a week at
>> most, and there's no particular change needed in the automation. Scaling up
>> or down takes a while, but it doesn't appear to be slower than any other
>> cluster. Configuration wise it's no different than a 5-node cluster either.
>> Pretty uneventful tbh.
>>
>> Tom van der Woerdt
>> Senior Site Reliability Engineer
>>
>> Booking.com  BV
>> Vijzelstraat Amsterdam Netherlands 1017HL
>> [image: Booking.com] 
>> Making it easier for everyone to experience the world since 1996
>> 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
>> million reported listings
>> Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)
>>
>>
>> On Wed, Oct 28, 2020 at 8:58 AM Gediminas Blazys <
>> gediminas.bla...@microsoft.com.invalid> wrote:
>>
>>> Hello,
>>>
>>>
>>>
>>> I wanted to seek out your opinion and experience.
>>>
>>>
>>>
>>> Has anyone of you had a chance to run a Cassandra cluster of more than
>>> 350 nodes?
>>>
>>> What are the major configuration considerations that you had to focus
>>> on? What number of vnodes did you use?
>>>
>>> Once the cluster was up and running what would you have done differently?
>>>
>>> Perhaps it would be more manageable to run multiple smaller clusters?
>>> Did you try this approach? What were the major challenges?
>>>
>>>
>>>
>>> I don’t know if questions like that are allowed here but I’m really
>>> interested in what other folks ran into while running massive operations.
>>>
>>>
>>>
>>> Gediminas
>>>
>>
>


Re: Running and Managing Large Cassandra Clusters

2020-10-28 Thread Joshua McKenzie
A few questions for you Tom if you have 30 seconds and care to disclose:

   1. What version of C*?
   2. What compaction strategy?
   3. What's core count allocated per C* node?
   4. Gossip give you any headaches / you have to be delicate there or does
   it behave itself?

Context: pmc/committer and I manage the OSS C* team at DataStax. We're
doing a lot of thinking about how to generally improve the operator
experience across the board for folks in the post 4.0 time frame, so data
like the above (where things are going well at scale and why) is super
useful to help feed into that effort.

Thanks!



On Wed, Oct 28, 2020 at 7:14 AM, Tom van der Woerdt <
tom.vanderwoe...@booking.com.invalid> wrote:

> Does 360 count? :-)
>
> num_tokens is 16, works fine (had 256 on a 300 node cluster as well, not
> too many problems either). Roughly 2.5TB per node, running on-prem on
> reasonably stable hardware so replacements end up happening once a week at
> most, and there's no particular change needed in the automation. Scaling up
> or down takes a while, but it doesn't appear to be slower than any other
> cluster. Configuration wise it's no different than a 5-node cluster either.
> Pretty uneventful tbh.
>
> Tom van der Woerdt
> Senior Site Reliability Engineer
>
> Booking.com  BV
> Vijzelstraat Amsterdam Netherlands 1017HL
> [image: Booking.com] 
> Making it easier for everyone to experience the world since 1996
> 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
> million reported listings
> Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)
>
>
> On Wed, Oct 28, 2020 at 8:58 AM Gediminas Blazys  microsoft.com.invalid> wrote:
>
> Hello,
>
>
>
> I wanted to seek out your opinion and experience.
>
>
>
> Has anyone of you had a chance to run a Cassandra cluster of more than 350
> nodes?
>
> What are the major configuration considerations that you had to focus on?
> What number of vnodes did you use?
>
> Once the cluster was up and running what would you have done differently?
>
> Perhaps it would be more manageable to run multiple smaller clusters? Did
> you try this approach? What were the major challenges?
>
>
>
> I don’t know if questions like that are allowed here but I’m really
> interested in what other folks ran into while running massive operations.
>
>
>
> Gediminas
>
>


Re: Running and Managing Large Cassandra Clusters

2020-10-28 Thread Tom van der Woerdt
Does 360 count? :-)

num_tokens is 16, works fine (had 256 on a 300 node cluster as well, not
too many problems either). Roughly 2.5TB per node, running on-prem on
reasonably stable hardware so replacements end up happening once a week at
most, and there's no particular change needed in the automation. Scaling up
or down takes a while, but it doesn't appear to be slower than any other
cluster. Configuration wise it's no different than a 5-node cluster either.
Pretty uneventful tbh.

Tom van der Woerdt
Senior Site Reliability Engineer

Booking.com BV
Vijzelstraat Amsterdam Netherlands 1017HL
[image: Booking.com] 
Making it easier for everyone to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
million reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)


On Wed, Oct 28, 2020 at 8:58 AM Gediminas Blazys
 wrote:

> Hello,
>
>
>
> I wanted to seek out your opinion and experience.
>
>
>
> Has anyone of you had a chance to run a Cassandra cluster of more than 350
> nodes?
>
> What are the major configuration considerations that you had to focus on?
> What number of vnodes did you use?
>
> Once the cluster was up and running what would you have done differently?
>
> Perhaps it would be more manageable to run multiple smaller clusters? Did
> you try this approach? What were the major challenges?
>
>
>
> I don’t know if questions like that are allowed here but I’m really
> interested in what other folks ran into while running massive operations.
>
>
>
> Gediminas
>
>
>