Re: Is it possible to bootstrap the 1st node of a new DC?

2015-09-10 Thread horschi
Hi Samuel,

thanks a lot for the jira link. Another reason to upgrade to 2.1 :-)

regards,
Christian



On Thu, Sep 10, 2015 at 1:28 PM, Samuel CARRIERE 
wrote:

> Hi Christian,
> The problem you mention (violation of constency) is a true one. If I have
> understood correctly, it is resolved in cassandra 2.1 (see CASSANDRA-2434).
> Regards,
> Samuel
>
>
> horschi  a écrit sur 10/09/2015 12:41:41 :
>
> > De : horschi 
> > A : user@cassandra.apache.org,
> > Date : 10/09/2015 12:42
> > Objet : Re: Is it possible to bootstrap the 1st node of a new DC?
> >
> > Hi Rob,
> >
> > regarding 1-3:
> > Thank you for the step-by-step explanation :-) My mistake was to use
> > join_ring=false during the inital start already. It now works for me
> > as its supposed to. Nevertheless it does not what I want, as it does
> > not take writes during the time of repair/rebuild: Running an 8 hour
> > repair will lead to 8 hours of data missing.
> >
> > regarding 1-6:
> > This is what we did. And it works of course. Our issue was just that
> > we had some global-QUORUMS hidden somewhere, which the operator was
> > not aware of. Therefore it would have been nice if the ops guy could
> > prevent these reads by himself.
> >
> >
> > Another issue I think the current bootstrapping process has: Doesn't
> > it practically reduce the RF for old data by one? (With old data I
> > mean any data that was written before the bootstrap).
> >
> > Let me give an example:
> >
> > Lets assume I have a cluster of Node 1,2 and 3 with RF=3. And lets
> > assume a single write on node 2 got lost. So this particular write
> > is only available on node 1 and 3.
> >
> > Now I add node 4, which takes the range in such a way that node 1
> > will not own that previously written key any more. Also assume that
> > the new node loads its data from node 2.
> >
> > This means we have a cluster where the previously mentioned write is
> > only on node 3. (Node 1 is not responsible for the key any more and
> > node 4 loaded its data from the wrong node)
> >
> > Any quorum-read that hit node 2 & 4 will not return the column. So
> > this means we effectively lowered the CL/RF.
> >
> > Therefore what I would like to be able to do is:
> > - Add new node 4, but leave it in a joining state. (This means it
> > gets all the writes but does not serve reads.)
> > - Do "nodetool rebuild"
> > - New node should not serve reads yet. And node 1 should not yet
> > give up its ranges to node 4.
> > - Do "nodetool repair", to ensure consistency.
> > - Finish bootstrap. Now node1 should not be responsible for the
> > range and node4 should become eligible for reads.
> >
> > regards,
> > Christian
> >
> > On Tue, Sep 8, 2015 at 11:51 PM, Robert Coli 
> wrote:
> > On Tue, Sep 8, 2015 at 2:39 PM, horschi  wrote:
> > I tried to set up a new node with join_ring=false once. In my test
> > that node did not pick a token in the ring. I assume running repair
> > or rebuild would not do anything in that case: No tokens = no data.
> > But I must admit: I have not tried running rebuild.
> >
> > I admit I haven't been following this thread closely, perhaps I have
> > missed what exactly it is you're trying to do.
> >
> > It's possible you'd need to :
> >
> > 1) join the node with auto_bootstrap=false
> > 2) immediately stop it
> > 3) re-start it with join_ring=false
> >
> > To actually use repair or rebuild in this way.
> >
> > However, if your goal is to create a new data-center and rebuild a
> > node there without any risk of reading from that node while creating
> > the new data center, you can just :
> >
> > 1) create nodes in new data-center, with RF=0 for that DC
> > 2) change RF in that DC
> > 3) run rebuild on new data-center nodes
> > 4) while doing so, don't talk to new data-center coordinators from your
> client
> > 5) and also use LOCAL_ONE/LOCAL_QUORUM to avoid cross-data-center
> > reads from your client
> > 6) modulo the handful of current bugs which make 5) currently imperfect
> >
> > What problem are you encountering with this procedure? If it's this ...
> >
> > I've learned from experience that the node immediately joins the
> > cluster, and starts accepting reads (from other DCs) for the range it
> owns.
> >
> > This seems to be the incorrect assumption at the heart of the
> > confusion. You "should" be able to prevent this behavior entirely
> > via correct use of ConsistencyLevel and client configuration.
> >
> > In an ideal world, I'd write a detailed blog post explaining this...
> > :/ in my copious spare time...
> >
> > =Rob
> >
>


Re: High CPU usage on some of nodes

2015-09-10 Thread Robert Wille
It sounds like its probably GC. Grep for GC in system.log to verify. If it is 
GC, there are a myriad of issues that could cause it, but at least you’ve 
narrowed it down.

On Sep 9, 2015, at 11:05 PM, Roman Tkachenko  wrote:

> Hey guys,
> 
> We've been having issues in the past couple of days with CPU usage / load 
> average suddenly skyrocketing on some nodes of the cluster, affecting 
> performance significantly so majority of requests start timing out. It can go 
> on for several hours, with CPU spiking through the roof then coming back down 
> to norm and so on. Weirdly, it affects only a subset of nodes and it's always 
> the same ones. The boxes Cassandra is running on are pretty beefy, 24 cores, 
> and these CPU spikes go up to >1000%.
> 
> What is the best way to debug such kind of issues and find out what Cassandra 
> is doing during spikes like this? Doesn't seem to be compaction related as 
> sometimes during these spikes "nodetool compactionstats" says no compactions 
> are running.
> 
> Thanks!
> 



Re: Is it possible to bootstrap the 1st node of a new DC?

2015-09-10 Thread Samuel CARRIERE
Hi Christian,
The problem you mention (violation of constency) is a true one. If I have 
understood correctly, it is resolved in cassandra 2.1 (see 
CASSANDRA-2434).
Regards,
Samuel


horschi  a écrit sur 10/09/2015 12:41:41 :

> De : horschi 
> A : user@cassandra.apache.org, 
> Date : 10/09/2015 12:42
> Objet : Re: Is it possible to bootstrap the 1st node of a new DC?
> 
> Hi Rob,
> 
> regarding 1-3:
> Thank you for the step-by-step explanation :-) My mistake was to use
> join_ring=false during the inital start already. It now works for me
> as its supposed to. Nevertheless it does not what I want, as it does
> not take writes during the time of repair/rebuild: Running an 8 hour
> repair will lead to 8 hours of data missing.
> 
> regarding 1-6:
> This is what we did. And it works of course. Our issue was just that
> we had some global-QUORUMS hidden somewhere, which the operator was 
> not aware of. Therefore it would have been nice if the ops guy could
> prevent these reads by himself.
> 
> 
> Another issue I think the current bootstrapping process has: Doesn't
> it practically reduce the RF for old data by one? (With old data I 
> mean any data that was written before the bootstrap).
> 
> Let me give an example:
> 
> Lets assume I have a cluster of Node 1,2 and 3 with RF=3. And lets 
> assume a single write on node 2 got lost. So this particular write 
> is only available on node 1 and 3.
> 
> Now I add node 4, which takes the range in such a way that node 1 
> will not own that previously written key any more. Also assume that 
> the new node loads its data from node 2.
> 
> This means we have a cluster where the previously mentioned write is
> only on node 3. (Node 1 is not responsible for the key any more and 
> node 4 loaded its data from the wrong node)
> 
> Any quorum-read that hit node 2 & 4 will not return the column. So 
> this means we effectively lowered the CL/RF.
> 
> Therefore what I would like to be able to do is:
> - Add new node 4, but leave it in a joining state. (This means it 
> gets all the writes but does not serve reads.)
> - Do "nodetool rebuild"
> - New node should not serve reads yet. And node 1 should not yet 
> give up its ranges to node 4.
> - Do "nodetool repair", to ensure consistency.
> - Finish bootstrap. Now node1 should not be responsible for the 
> range and node4 should become eligible for reads.
> 
> regards,
> Christian
> 
> On Tue, Sep 8, 2015 at 11:51 PM, Robert Coli  
wrote:
> On Tue, Sep 8, 2015 at 2:39 PM, horschi  wrote:
> I tried to set up a new node with join_ring=false once. In my test 
> that node did not pick a token in the ring. I assume running repair 
> or rebuild would not do anything in that case: No tokens = no data. 
> But I must admit: I have not tried running rebuild.
> 
> I admit I haven't been following this thread closely, perhaps I have
> missed what exactly it is you're trying to do.
> 
> It's possible you'd need to :
> 
> 1) join the node with auto_bootstrap=false
> 2) immediately stop it
> 3) re-start it with join_ring=false
> 
> To actually use repair or rebuild in this way.
> 
> However, if your goal is to create a new data-center and rebuild a 
> node there without any risk of reading from that node while creating
> the new data center, you can just :
> 
> 1) create nodes in new data-center, with RF=0 for that DC
> 2) change RF in that DC
> 3) run rebuild on new data-center nodes
> 4) while doing so, don't talk to new data-center coordinators from your 
client
> 5) and also use LOCAL_ONE/LOCAL_QUORUM to avoid cross-data-center 
> reads from your client
> 6) modulo the handful of current bugs which make 5) currently imperfect
> 
> What problem are you encountering with this procedure? If it's this ...
> 
> I've learned from experience that the node immediately joins the 
> cluster, and starts accepting reads (from other DCs) for the range it 
owns.
> 
> This seems to be the incorrect assumption at the heart of the 
> confusion. You "should" be able to prevent this behavior entirely 
> via correct use of ConsistencyLevel and client configuration.
> 
> In an ideal world, I'd write a detailed blog post explaining this...
> :/ in my copious spare time...
> 
> =Rob
>  

Re: Is it possible to bootstrap the 1st node of a new DC?

2015-09-10 Thread horschi
Hi Rob,

regarding 1-3:
Thank you for the step-by-step explanation :-) My mistake was to use
join_ring=false during the inital start already. It now works for me as its
supposed to. Nevertheless it does not what I want, as it does not take
writes during the time of repair/rebuild: Running an 8 hour repair will
lead to 8 hours of data missing.


regarding 1-6:
This is what we did. And it works of course. Our issue was just that we had
some global-QUORUMS hidden somewhere, which the operator was not aware of.
Therefore it would have been nice if the ops guy could prevent these reads
by himself.




Another issue I think the current bootstrapping process has: Doesn't it
practically reduce the RF for old data by one? (With old data I mean any
data that was written before the bootstrap).

Let me give an example:

Lets assume I have a cluster of Node 1,2 and 3 with RF=3. And lets assume a
single write on node 2 got lost. So this particular write is only available
on node 1 and 3.

Now I add node 4, which takes the range in such a way that node 1 will not
own that previously written key any more. Also assume that the new node
loads its data from node 2.

This means we have a cluster where the previously mentioned write is only
on node 3. (Node 1 is not responsible for the key any more and node 4
loaded its data from the wrong node)

Any quorum-read that hit node 2 & 4 will not return the column. So this
means we effectively lowered the CL/RF.


Therefore what I would like to be able to do is:
- Add new node 4, but leave it in a joining state. (This means it gets all
the writes but does not serve reads.)
- Do "nodetool rebuild"
- New node should not serve reads yet. And node 1 should not yet give up
its ranges to node 4.
- Do "nodetool repair", to ensure consistency.
- Finish bootstrap. Now node1 should not be responsible for the range and
node4 should become eligible for reads.


regards,
Christian




On Tue, Sep 8, 2015 at 11:51 PM, Robert Coli  wrote:

> On Tue, Sep 8, 2015 at 2:39 PM, horschi  wrote:
>
>> I tried to set up a new node with join_ring=false once. In my test that
>> node did not pick a token in the ring. I assume running repair or rebuild
>> would not do anything in that case: No tokens = no data. But I must admit:
>> I have not tried running rebuild.
>>
>
> I admit I haven't been following this thread closely, perhaps I have
> missed what exactly it is you're trying to do.
>
> It's possible you'd need to :
>
> 1) join the node with auto_bootstrap=false
> 2) immediately stop it
> 3) re-start it with join_ring=false
>
> To actually use repair or rebuild in this way.
>
> However, if your goal is to create a new data-center and rebuild a node
> there without any risk of reading from that node while creating the new
> data center, you can just :
>
> 1) create nodes in new data-center, with RF=0 for that DC
> 2) change RF in that DC
> 3) run rebuild on new data-center nodes
> 4) while doing so, don't talk to new data-center coordinators from your
> client
> 5) and also use LOCAL_ONE/LOCAL_QUORUM to avoid cross-data-center reads
> from your client
> 6) modulo the handful of current bugs which make 5) currently imperfect
>
> What problem are you encountering with this procedure? If it's this ...
>
> I've learned from experience that the node immediately joins the cluster,
>> and starts accepting reads (from other DCs) for the range it owns.
>
>
> This seems to be the incorrect assumption at the heart of the confusion.
> You "should" be able to prevent this behavior entirely via correct use of
> ConsistencyLevel and client configuration.
>
> In an ideal world, I'd write a detailed blog post explaining this... :/ in
> my copious spare time...
>
> =Rob
>
>
>


Re: High CPU usage on some of nodes

2015-09-10 Thread Samuel CARRIERE
Hi Roman,
If it affects only a subset of nodes and it's always the same ones, it 
could be a "problem" with your data model : maybe some (too) wide rows on 
theses nodes.
If one of your row is too wide, the deserialisation of the columns index 
of this row can take a lot of resources (disk, RAM, and CPU).
If you are using leveled compaction strategy and you see anormaly big 
sstables on thoses nodes, it could be a clue.
Regards,
Samuel

Robert Wille  a écrit sur 10/09/2015 15:27:41 :

> De : Robert Wille 
> A : "user@cassandra.apache.org" , 
> Date : 10/09/2015 15:30
> Objet : Re: High CPU usage on some of nodes
> 
> It sounds like its probably GC. Grep for GC in system.log to verify.
> If it is GC, there are a myriad of issues that could cause it, but 
> at least you?ve narrowed it down.
> 
> On Sep 9, 2015, at 11:05 PM, Roman Tkachenko  
wrote:
> 
> > Hey guys,
> > 
> > We've been having issues in the past couple of days with CPU usage
> / load average suddenly skyrocketing on some nodes of the cluster, 
> affecting performance significantly so majority of requests start 
> timing out. It can go on for several hours, with CPU spiking through
> the roof then coming back down to norm and so on. Weirdly, it 
> affects only a subset of nodes and it's always the same ones. The 
> boxes Cassandra is running on are pretty beefy, 24 cores, and these 
> CPU spikes go up to >1000%.
> > 
> > What is the best way to debug such kind of issues and find out 
> what Cassandra is doing during spikes like this? Doesn't seem to be 
> compaction related as sometimes during these spikes "nodetool 
> compactionstats" says no compactions are running.
> > 
> > Thanks!
> > 
> 


Re: Network / GC / Latency spike

2015-09-10 Thread Alain RODRIGUEZ
Hi, just wanted to drop the follow up here.

I finally figure out that bigdata guys were basically hammering the cluster
by reading 2 month of data as fast as possible on one table at boot time to
cache it. As this table is storing 12 MB blobs (Bloom Filters), even if the
number of reads was not very high, as each row is really big, reads + read
repairs were putting to much pressure on Cassandra. Those reads were mixed
with much higher workloads so I was not seeing any burst in reads, making
this harder to troubleshoot. Local reads (from Sematext / Opscenter) helped
finding this out.

Given the use case (no random reads, write once, no update) and the data
size for each element, we will get this out of Cassandra to some HDFS or S3
storage, basically. We do not need any database for this kind of job.
Meanwhile we just disabled this feature as it is not something critical.

@Fabien, Thank you for your help.

C*heers,

Alain

2015-09-02 0:43 GMT+02:00 Fabien Rousseau :

> Hi Alain,
>
> Maybe it's possible to confirm this by testing on a small cluster:
> - create a cluster of 2 nodes (using https://github.com/pcmanus/ccm for
> example)
> - create a fake wide row of a few mb (using the python driver for example)
> - drain and stop one of the two nodes
> - remove the sstables of the stopped node (to provoke inconsistencies)
> - start it again
> - select a small portion of the wide row (many times, use nodetool tpstats
> to know when a read repair has been triggered)
> - nodetool flush (on the previously stopped node)
> - check the size of the sstable (if a few kb, then only the selected slice
> was repaired, but if a few mb then the whole row was repaired)
>
> The wild guess was: if a read repair was triggered when reading a small
> portion of a wide row and if it resulted in streaming the whole wide row,
> it could explain a network burst. (But, on a second thought it make more
> sense to only repair the small portion being read...)
>
>
>
> 2015-09-01 12:05 GMT+02:00 Alain RODRIGUEZ :
>
>> Hi Fabien, thanks for your help.
>>
>> I did not mention it but I indeed saw a correlation between latency and
>> read repairs spikes. Though this is like going from 5 RR per second to 10
>> per sec cluster wide according to opscenter: http://img42.com/L6gx1
>>
>> I have indeed some wide rows and this explanation looks reasonable to me,
>> I mean this makes sense. Yet isn't this amount of Read Repair too low to
>> induce such a "shitstorm" (even if it spikes x2, I got network x10) ? Also
>> wide rows are present on heavy used tables (sadly...), so I should be using
>> more network all the time (why only a few spikes per day (like 2 / 3 max) ?
>>
>> How could I confirm this, without removing RR and waiting a week I mean,
>> is there a way to see the size of the data being repaired through this
>> mechanism ?
>>
>> C*heers
>>
>> Alain
>>
>> 2015-09-01 0:11 GMT+02:00 Fabien Rousseau :
>>
>>> Hi Alain,
>>>
>>> Could it be wide rows + read repair ? (Let's suppose the "read repair"
>>> repairs the full row, and it may not be subject to stream throughput limit)
>>>
>>> Best Regards
>>> Fabien
>>>
>>> 2015-08-31 15:56 GMT+02:00 Alain RODRIGUEZ :
>>>
 I just realised that I have no idea about how this mailing list handle
 attached files.

 Please find screenshots there --> http://img42.com/collection/y2KxS

 Alain

 2015-08-31 15:48 GMT+02:00 Alain RODRIGUEZ :

> Hi,
>
> Running a 2.0.16 C* on AWS (private VPC, 2 DC).
>
> I am facing an issue on our EU DC where I have a network burst
> (alongside with GC and latency increase).
>
> My first thought was a sudden application burst, though, I see no
> corresponding evolution on reads / write or even CPU.
>
> So I thought that this might come from the node themselves as IN
> almost equal OUT Network. I tried lowering stream throughput on the whole
> DC to 1 Mbps, with ~30 nodes --> 30 Mbps --> ~4 MB/s max. My network went 
> a
> lot higher about 30 M in both sides (see screenshots attached).
>
> I have tried to use iftop to see where this network is headed too, but
> I was not able to do it because burst are very shorts.
>
> So, questions are:
>
> - Did someone experienced something similar already ? If so, any clue
> would be appreciated :).
> - How can I know (monitor, capture) where this big amount of network
> is headed to or due to ?
> - Am I right trying to figure out what this network is or should I
> follow an other lead ?
>
> Notes: I also noticed that CPU does not spike nor does R, but disk
> reads also spikes !
>
> C*heers,
>
> Alain
>


>>>
>>
>


Re: Should replica placement change after a topology change?

2015-09-10 Thread Richard Dawe
Hi Robert,

Firstly, thank you very much for you help. I have some comments inline below.

On 10/09/2015 01:26, "Robert Coli" 
> wrote:

On Wed, Sep 9, 2015 at 7:52 AM, Richard Dawe 
> wrote:
I am investigating various topology changes, and their effect on replica 
placement. As far as I can tell, replica placement is not changing after I’ve 
changed the topology and run nodetool repair + cleanup. I followed the 
procedure described at 
http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_switch_snitch.html

That's probably a good thing. I'm going to be modifying the warning in the 
cassandra.yaml to advise users that in practice the only change of snitch or 
replication strategy one can safely do is one in which replica placement does 
not change. It currently says that you need to repair, but there are plenty of 
scenarios where you lose all existing replicas for a given datum, and are 
therefore unable to repair. The key is that you need at least one replica to 
stay the same or repair is worthless. And if you only have one replica staying 
the same, you lose any consistency consistency contract you might have been 
operating under. One ALMOST NEVER ACTUALLY WANTS TO DO ANYTHING BUT A NO-OP 
HERE.

So if you have a topology that would change if you switched from SimpleStrategy 
to NetworkTopologyStrategy plus multiple racks, it sounds like a different 
migration strategy would be needed?

I am imagining:

  1.  Switch to a different snitch, and the keyspace from SimpleStrategy to NTS 
but keep it all in one rack. So effectively the same topology, but with a 
different snitch.
  2.  Set up a new data centre with the desired topology.
  3.  Change the keyspace to have replicas in the new DC.
  4.  Rebuild all the nodes in the new DC.
  5.  Flip all your clients over to the new DC.
  6.  Decommission your original DC.

Or something like that.


Here is my test scenario : 


  1.  To determine the token range ownership, I used “nodetool ring ” 
and “nodetool info –T ”. I saved the output of those commands with 
the original topology, after changing the topology, after repairing, after 
changing the replication strategy, and then again after repairing. In no cases 
did the tokens change. It looks like nodetool ring and nodetool info –T show 
the owner but not the replicas for a particular range.

The tokens and ranges shouldn't be changing, the replica placement should be. 
AFAIK neither of those commands show you replica placement, they show you 
primary range ownership.

Use getendpoints to determine replica placement before and after.


Thanks, I will play with that when I have a chance next week.


I was expecting the replica placement to change. Because the racks were 
assigned in groups (rather than alternating), I was expecting the original 
replica placement with SimpleStrategy to be non-optimal after switching to 
NetworkTopologyStrategy. E.g.: if some data was replicated to nodes 1, 2 and 3, 
then after the topology change there would be 2 replicas in RAC1, 1 in RAC2 and 
none in RAC3. And hence when the repair ran, it would remove one replica from 
RAC1 and make sure that there was a replica in RAC3.

I would expect this to be the case.

However, when I did a query using cqlsh at consistency QUORUM, I saw that it 
was hitting two replicas in the same rack, and a replica in a different rack. 
This suggests that the replica placement did not change after the topology 
change.

Perhaps you are seeing the quirks of the current rack-aware implementation, 
explicated here?

https://issues.apache.org/jira/browse/CASSANDRA-3810


Thanks. I need to re-read that a few times to understand it.

Is there some way I can see which nodes have a replica for a given token range?

Not for a range, but for a given key with nodetool getendpoints.

I wonder if there would be value to the range... in the pre-vnode past I have 
merely generated a key for each range. With the number of ranges increased so 
dramatically by vnodes, it might be easier to have an endpoint that works on 
ranges...

Thank you again. Best regards, Rich


=Rob



Re: High CPU usage on some of nodes

2015-09-10 Thread Roman Tkachenko
Thanks for the responses guys.

I also suspected GC and I guess it could be it, since during the spikes
logs are filled with messages like "GC for ConcurrentMarkSweep: 5908 ms for
1 collections, 1986282520 used; max is 8375238656", often right before
messages about dropped queries, unlike other, unaffected, nodes that only
have "GC for ParNew: 230 ms for 1 collections, 4418571760 used; max is
8375238656" type of messages.

Is my best shot to play with JVM settings trying to tune garbage collection
then?


On Thu, Sep 10, 2015 at 6:52 AM, Samuel CARRIERE 
wrote:

> Hi Roman,
> If it affects only a subset of nodes and it's always the same ones, it
> could be a "problem" with your data model : maybe some (too) wide rows on
> theses nodes.
> If one of your row is too wide, the deserialisation of the columns index
> of this row can take a lot of resources (disk, RAM, and CPU).
> If you are using leveled compaction strategy and you see anormaly big
> sstables on thoses nodes, it could be a clue.
> Regards,
> Samuel
>
> Robert Wille  a écrit sur 10/09/2015 15:27:41 :
>
> > De : Robert Wille 
> > A : "user@cassandra.apache.org" ,
> > Date : 10/09/2015 15:30
> > Objet : Re: High CPU usage on some of nodes
> >
> > It sounds like its probably GC. Grep for GC in system.log to verify.
> > If it is GC, there are a myriad of issues that could cause it, but
> > at least you’ve narrowed it down.
> >
> > On Sep 9, 2015, at 11:05 PM, Roman Tkachenko 
> wrote:
> >
> > > Hey guys,
> > >
> > > We've been having issues in the past couple of days with CPU usage
> > / load average suddenly skyrocketing on some nodes of the cluster,
> > affecting performance significantly so majority of requests start
> > timing out. It can go on for several hours, with CPU spiking through
> > the roof then coming back down to norm and so on. Weirdly, it
> > affects only a subset of nodes and it's always the same ones. The
> > boxes Cassandra is running on are pretty beefy, 24 cores, and these
> > CPU spikes go up to >1000%.
> > >
> > > What is the best way to debug such kind of issues and find out
> > what Cassandra is doing during spikes like this? Doesn't seem to be
> > compaction related as sometimes during these spikes "nodetool
> > compactionstats" says no compactions are running.
> > >
> > > Thanks!
> > >
> >
>


Re: High CPU usage on some of nodes

2015-09-10 Thread Jeff Jirsa
With a 5s collection, the problem is almost certainly GC. 

GC pressure can be caused by a number of things, including normal read/write 
loads, but ALSO compaction calculation (pre-2.1.9 / #9882) and very large 
partitions (trying to load a very large partition with something like row cache 
in 2.0 and earlier, or issuing a full row read where the row is larger than you 
expect). 

You can try to tune the GC behavior, but the underlying problem may be 
something like a bad data model (which Samuel suggested), and no amount of GC 
tuning is going to fix trying to do bad things with very big rows. 



From:  Roman Tkachenko
Reply-To:  "user@cassandra.apache.org"
Date:  Thursday, September 10, 2015 at 10:54 AM
To:  "user@cassandra.apache.org"
Subject:  Re: High CPU usage on some of nodes

Thanks for the responses guys. 

I also suspected GC and I guess it could be it, since during the spikes logs 
are filled with messages like "GC for ConcurrentMarkSweep: 5908 ms for 1 
collections, 1986282520 used; max is 8375238656", often right before messages 
about dropped queries, unlike other, unaffected, nodes that only have "GC for 
ParNew: 230 ms for 1 collections, 4418571760 used; max is 8375238656" type of 
messages.

Is my best shot to play with JVM settings trying to tune garbage collection 
then?


On Thu, Sep 10, 2015 at 6:52 AM, Samuel CARRIERE  
wrote:
Hi Roman, 
If it affects only a subset of nodes and it's always the same ones, it could be 
a "problem" with your data model : maybe some (too) wide rows on theses nodes.
If one of your row is too wide, the deserialisation of the columns index of 
this row can take a lot of resources (disk, RAM, and CPU).
If you are using leveled compaction strategy and you see anormaly big sstables 
on thoses nodes, it could be a clue.
Regards, 
Samuel 

Robert Wille  a écrit sur 10/09/2015 15:27:41 :

> De : Robert Wille 
> A : "user@cassandra.apache.org" , 
> Date : 10/09/2015 15:30 
> Objet : Re: High CPU usage on some of nodes 
> 
> It sounds like its probably GC. Grep for GC in system.log to verify.
> If it is GC, there are a myriad of issues that could cause it, but 
> at least you’ve narrowed it down.
> 
> On Sep 9, 2015, at 11:05 PM, Roman Tkachenko  wrote:
> 
> > Hey guys,
> > 
> > We've been having issues in the past couple of days with CPU usage
> / load average suddenly skyrocketing on some nodes of the cluster, 
> affecting performance significantly so majority of requests start 
> timing out. It can go on for several hours, with CPU spiking through
> the roof then coming back down to norm and so on. Weirdly, it 
> affects only a subset of nodes and it's always the same ones. The 
> boxes Cassandra is running on are pretty beefy, 24 cores, and these 
> CPU spikes go up to >1000%.
> > 
> > What is the best way to debug such kind of issues and find out 
> what Cassandra is doing during spikes like this? Doesn't seem to be 
> compaction related as sometimes during these spikes "nodetool 
> compactionstats" says no compactions are running.
> > 
> > Thanks!
> > 
> 




smime.p7s
Description: S/MIME cryptographic signature


Re: High CPU usage on some of nodes

2015-09-10 Thread Robert Coli
On Thu, Sep 10, 2015 at 10:54 AM, Roman Tkachenko 
wrote:
>
> [5 second CMS GC] Is my best shot to play with JVM settings trying to tune
> garbage collection then?
>

Yep. As a minor note, if the machines are that beefy, they probably have a
lot of RAM, you might wish to consider trying G1 GC and a larger heap.

=Rob


Re: Should replica placement change after a topology change?

2015-09-10 Thread Robert Coli
On Thu, Sep 10, 2015 at 8:55 AM, Richard Dawe 
wrote:

> So if you have a topology that would change if you switched from
> SimpleStrategy to NetworkTopologyStrategy plus multiple racks, it sounds
> like a different migration strategy would be needed?
>
> I am imagining:
>
>1. Switch to a different snitch, and the keyspace from SimpleStrategy
>to NTS but keep it all in one rack. So effectively the same topology, but
>with a different snitch.
>2. Set up a new data centre with the desired topology.
>3. Change the keyspace to have replicas in the new DC.
>4. Rebuild all the nodes in the new DC.
>5. Flip all your clients over to the new DC.
>6. Decommission your original DC.
>
> That would work, yes. I would add :

- 4.5. Repair all nodes.

But really, avoid getting in this situation in the first place... :D

=Rob


Re: High CPU usage on some of nodes

2015-09-10 Thread Graham Sanderson
Haven’t been following this thread, but we run beefy machines with 8gig new 
gen, 12 gig old gen (down from 16g since moving memtables off heap, we can 
probably go lower)…

Apart from making sure you have all the latest -XX: flags from cassandra-env.sh 
(and MALLOC_ARENA_MAX), I personally would recommend running latest 2.1.x with

memory_allocator: JEMallocAllocator
memtable_allocation_type: offheap_objects

Some people will probably disagree, but it works great for us (rare long pauses 
sub 2 secs), and if you’re seeing slow GC because of promotion failure of 
objects 131074 dwords big, then I definitely suggest you give it a try.

> On Sep 10, 2015, at 1:43 PM, Robert Coli  wrote:
> 
> On Thu, Sep 10, 2015 at 10:54 AM, Roman Tkachenko  > wrote: 
> [5 second CMS GC] Is my best shot to play with JVM settings trying to tune 
> garbage collection then?
> 
> Yep. As a minor note, if the machines are that beefy, they probably have a 
> lot of RAM, you might wish to consider trying G1 GC and a larger heap.
> 
> =Rob
> 
>  



smime.p7s
Description: S/MIME cryptographic signature


Re: confusion about nodetool cfstats

2015-09-10 Thread Chris Lohfink
All metrics reported in cfstats are for just the one node (its pulled from
jmx). To see cluster aggregates its best to use a tool for monitoring like
opscenter, graphite, influxdb, nagios etc. Its a good idea to have one of
these something like this setup for many reasons anyway.

If you are using DSE you can use the performance service to get some of the
metrics (including aggregates across dc, keyspace, cluster etc) from CQL.

Chris Lohfink

On Thu, Sep 10, 2015 at 9:38 PM, Shuo Chen  wrote:

> Sorry to send the previous message.
>
> I want to monitor columnfamily space used with nodetool cfstats. The
> document says,
> Space used (live), bytes:9592399Space that is measured depends on
> operating system
>
> Is this metric shows space used on one nodes or on the whole cluster?
>
> If it is just one node, is there a method to retrieve load info on the
> whole cluster?
>
> 
> Shuo Chen
>
>
> On Fri, Sep 11, 2015 at 10:36 AM, Shuo Chen  wrote:
>
>> Hi!
>>
>> I want to monitor columnfamily space used with nodetool cfstats. The
>> document says,
>> Space used (live), bytes:9592399Space that is measured depends on
>> operating system
>>
>
>


confusion about nodetool cfstats

2015-09-10 Thread Shuo Chen
Hi!

I want to monitor columnfamily space used with nodetool cfstats. The
document says,
Space used (live), bytes:9592399Space that is measured depends on operating
system


Re: confusion about nodetool cfstats

2015-09-10 Thread Shuo Chen
Sorry to send the previous message.

I want to monitor columnfamily space used with nodetool cfstats. The
document says,
Space used (live), bytes:9592399Space that is measured depends on operating
system

Is this metric shows space used on one nodes or on the whole cluster?

If it is just one node, is there a method to retrieve load info on the
whole cluster?


Shuo Chen


On Fri, Sep 11, 2015 at 10:36 AM, Shuo Chen  wrote:

> Hi!
>
> I want to monitor columnfamily space used with nodetool cfstats. The
> document says,
> Space used (live), bytes:9592399Space that is measured depends on
> operating system
>


Re: Should replica placement change after a topology change?

2015-09-10 Thread Robert Coli
On Thu, Sep 10, 2015 at 12:33 PM, Nate McCall 
wrote:

> I can confirm that the above process works (definitely include Rob's
>> repair suggestion, though). It is really the only way we've found to safely
>> go from SimpleSnitch to rack-aware NTS.
>>
>
> The same process works/is required for SimpleSnitch to Ec2Snitch fwiw.
>

I have safely gone from SimpleSnitch/Strategy to NTS/Ec2Snitch by doing a
NOOP in terms of replica placement, a few times.

This was before vnodes... I feel like vnodes may be a meaningful
impediment, they certainly make checking all ranges before and after much
more involved...

=Rob


Re: Should replica placement change after a topology change?

2015-09-10 Thread Nate McCall
>
>
> So if you have a topology that would change if you switched from
>> SimpleStrategy to NetworkTopologyStrategy plus multiple racks, it sounds
>> like a different migration strategy would be needed?
>>
>> I am imagining:
>>
>>1. Switch to a different snitch, and the keyspace from SimpleStrategy
>>to NTS but keep it all in one rack. So effectively the same topology, but
>>with a different snitch.
>>2. Set up a new data centre with the desired topology.
>>3. Change the keyspace to have replicas in the new DC.
>>4. Rebuild all the nodes in the new DC.
>>5. Flip all your clients over to the new DC.
>>6. Decommission your original DC.
>>
>> That would work, yes. I would add :
>
> - 4.5. Repair all nodes.
>

I can confirm that the above process works (definitely include Rob's repair
suggestion, though). It is really the only way we've found to safely go
from SimpleSnitch to rack-aware NTS.

The same process works/is required for SimpleSnitch to Ec2Snitch fwiw.




-- 
-
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


Is it normal to see a node version handshake with itself?

2015-09-10 Thread Eric Plowe
I noticed in the system.log of one of my nodes

INFO  [HANDSHAKE-mia1-cas-001.bongojuice.com/172.16.245.1] 2015-09-10
16:00:37,748 OutboundTcpConnection.java:485 - Handshaking version with
mia1-cas-001.bongojuice.com/172.16.245.1

The machine I am on is mia1-cas-001.

If it's nothing, never mind, just stood out to me.

~Eric