[jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair

2017-07-17 Thread Jeff Jirsa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091147#comment-16091147
 ] 

Jeff Jirsa commented on CASSANDRA-12965:


Relating to CASSANDRA-11303 , which is a "rethink inbound streaming throughput 
throttle" ticket, which would let us better tune this sort of behavior.


> StreamReceiveTask causing high CPU utilization during repair
> 
>
> Key: CASSANDRA-12965
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12965
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Randy Fradin
>
> During a full repair run, I observed one node in my cluster using 100% cpu 
> (100% of all cores on a 48-core machine). When I took a stack trace I found 
> exactly 48 running StreamReceiveTask threads. Each was in the same block of 
> code in StreamReceiveTask.OnCompletionRunnable:
> {noformat}
> "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 
> tid=0x7f01520a8800 nid=0x6e77 runnable [0x7f020dfae000]
>java.lang.Thread.State: RUNNABLE
> at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258)
> at java.util.ComparableTimSort.sort(ComparableTimSort.java:203)
> at java.util.Arrays.sort(Arrays.java:1312)
> at java.util.Arrays.sort(Arrays.java:1506)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:141)
> at 
> org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:257)
> at 
> org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:280)
> at 
> org.apache.cassandra.utils.IntervalTree.(IntervalTree.java:72)
> at 
> org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:590)
> at 
> org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:584)
> at 
> org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565)
> at 
> org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761)
> at 
> org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428)
> at 
> org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283)
> at 
> org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422)
> at 
> org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in 
> the IntervalNode constructor called from the IntervalTree constructor.
> It stayed this way for maybe an hour before we restarted the node. The repair 
> was also generating thousands (20,000+) of tiny SSTables in a table that 
> previously had just 20.
> I don't know enough about SSTables and ColumnFamilyStore to know if all this 
> CPU work is necessary or a bug, but I did notice that these tasks are run on 
> a thread pool constructed in StreamReceiveTask.java, so perhaps this pool 
> should have a thread count max less than the number of processors on the 
> machine, at least for machines with a lot of processors. Any reason not to do 
> that? Any ideas for a reasonable # or formula to cap the thread count?
> Some additional info: We have never run incremental repair on this cluster, 
> so that is not a factor. All our tables use LCS. Unfortunately I don't have 
> the log files from the period saved.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair

2017-07-17 Thread Randy Fradin (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090973#comment-16090973
 ] 

Randy Fradin commented on CASSANDRA-12965:
--

We set -Dcassandra.available_processors=(some number less than the # of cores 
on the host) as suggested by lieangsibin. It limits the size of several thread 
pools including this one. Not exactly a fix but at least prevents Cassandra 
from monopolizing all of the CPU resources.

> StreamReceiveTask causing high CPU utilization during repair
> 
>
> Key: CASSANDRA-12965
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12965
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Randy Fradin
>
> During a full repair run, I observed one node in my cluster using 100% cpu 
> (100% of all cores on a 48-core machine). When I took a stack trace I found 
> exactly 48 running StreamReceiveTask threads. Each was in the same block of 
> code in StreamReceiveTask.OnCompletionRunnable:
> {noformat}
> "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 
> tid=0x7f01520a8800 nid=0x6e77 runnable [0x7f020dfae000]
>java.lang.Thread.State: RUNNABLE
> at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258)
> at java.util.ComparableTimSort.sort(ComparableTimSort.java:203)
> at java.util.Arrays.sort(Arrays.java:1312)
> at java.util.Arrays.sort(Arrays.java:1506)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:141)
> at 
> org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:257)
> at 
> org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:280)
> at 
> org.apache.cassandra.utils.IntervalTree.(IntervalTree.java:72)
> at 
> org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:590)
> at 
> org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:584)
> at 
> org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565)
> at 
> org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761)
> at 
> org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428)
> at 
> org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283)
> at 
> org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422)
> at 
> org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in 
> the IntervalNode constructor called from the IntervalTree constructor.
> It stayed this way for maybe an hour before we restarted the node. The repair 
> was also generating thousands (20,000+) of tiny SSTables in a table that 
> previously had just 20.
> I don't know enough about SSTables and ColumnFamilyStore to know if all this 
> CPU work is necessary or a bug, but I did notice that these tasks are run on 
> a thread pool constructed in StreamReceiveTask.java, so perhaps this pool 
> should have a thread count max less than the number of processors on the 
> machine, at least for machines with a lot of processors. Any reason not to do 
> that? Any ideas for a reasonable # or formula to cap the thread count?
> Some additional info: We have never run incremental repair on this cluster, 
> so that is not a factor. All our tables use LCS. Unfortunately I don't have 
> the log files from the period saved.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair

2017-07-17 Thread Guillaume Drot (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089687#comment-16089687
 ] 

Guillaume Drot commented on CASSANDRA-12965:


[~rbfblk] Did you find a solution to your problem ? We have the same problem 
since we have upgraded to DSE OpsCenter 6. 

[~pauloricardomg] Did you had time to investigate the issue ?

Thanks.

> StreamReceiveTask causing high CPU utilization during repair
> 
>
> Key: CASSANDRA-12965
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12965
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Randy Fradin
>
> During a full repair run, I observed one node in my cluster using 100% cpu 
> (100% of all cores on a 48-core machine). When I took a stack trace I found 
> exactly 48 running StreamReceiveTask threads. Each was in the same block of 
> code in StreamReceiveTask.OnCompletionRunnable:
> {noformat}
> "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 
> tid=0x7f01520a8800 nid=0x6e77 runnable [0x7f020dfae000]
>java.lang.Thread.State: RUNNABLE
> at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258)
> at java.util.ComparableTimSort.sort(ComparableTimSort.java:203)
> at java.util.Arrays.sort(Arrays.java:1312)
> at java.util.Arrays.sort(Arrays.java:1506)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:141)
> at 
> org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:257)
> at 
> org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:280)
> at 
> org.apache.cassandra.utils.IntervalTree.(IntervalTree.java:72)
> at 
> org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:590)
> at 
> org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:584)
> at 
> org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565)
> at 
> org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761)
> at 
> org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428)
> at 
> org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283)
> at 
> org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422)
> at 
> org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in 
> the IntervalNode constructor called from the IntervalTree constructor.
> It stayed this way for maybe an hour before we restarted the node. The repair 
> was also generating thousands (20,000+) of tiny SSTables in a table that 
> previously had just 20.
> I don't know enough about SSTables and ColumnFamilyStore to know if all this 
> CPU work is necessary or a bug, but I did notice that these tasks are run on 
> a thread pool constructed in StreamReceiveTask.java, so perhaps this pool 
> should have a thread count max less than the number of processors on the 
> machine, at least for machines with a lot of processors. Any reason not to do 
> that? Any ideas for a reasonable # or formula to cap the thread count?
> Some additional info: We have never run incremental repair on this cluster, 
> so that is not a factor. All our tables use LCS. Unfortunately I don't have 
> the log files from the period saved.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair

2017-02-28 Thread liangsibin (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887642#comment-15887642
 ] 

liangsibin commented on CASSANDRA-12965:


maybe we can add -Dcassandra.available_processors=20 to lower the 
StreamReceiveTask threads  when cassandra startup.

> StreamReceiveTask causing high CPU utilization during repair
> 
>
> Key: CASSANDRA-12965
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12965
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Randy Fradin
>
> During a full repair run, I observed one node in my cluster using 100% cpu 
> (100% of all cores on a 48-core machine). When I took a stack trace I found 
> exactly 48 running StreamReceiveTask threads. Each was in the same block of 
> code in StreamReceiveTask.OnCompletionRunnable:
> {noformat}
> "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 
> tid=0x7f01520a8800 nid=0x6e77 runnable [0x7f020dfae000]
>java.lang.Thread.State: RUNNABLE
> at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258)
> at java.util.ComparableTimSort.sort(ComparableTimSort.java:203)
> at java.util.Arrays.sort(Arrays.java:1312)
> at java.util.Arrays.sort(Arrays.java:1506)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:141)
> at 
> org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:257)
> at 
> org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:280)
> at 
> org.apache.cassandra.utils.IntervalTree.(IntervalTree.java:72)
> at 
> org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:590)
> at 
> org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:584)
> at 
> org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565)
> at 
> org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761)
> at 
> org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428)
> at 
> org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283)
> at 
> org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422)
> at 
> org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in 
> the IntervalNode constructor called from the IntervalTree constructor.
> It stayed this way for maybe an hour before we restarted the node. The repair 
> was also generating thousands (20,000+) of tiny SSTables in a table that 
> previously had just 20.
> I don't know enough about SSTables and ColumnFamilyStore to know if all this 
> CPU work is necessary or a bug, but I did notice that these tasks are run on 
> a thread pool constructed in StreamReceiveTask.java, so perhaps this pool 
> should have a thread count max less than the number of processors on the 
> machine, at least for machines with a lot of processors. Any reason not to do 
> that? Any ideas for a reasonable # or formula to cap the thread count?
> Some additional info: We have never run incremental repair on this cluster, 
> so that is not a factor. All our tables use LCS. Unfortunately I don't have 
> the log files from the period saved.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair

2017-02-27 Thread liangsibin (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885877#comment-15885877
 ] 

liangsibin commented on CASSANDRA-12965:


when I use four nodes bulkload data into cassandra, sometimes one of the nodes 
cpu usage is almost 100%, which make bulkload very slow.Is there any method can 
solve this problem? 
help

> StreamReceiveTask causing high CPU utilization during repair
> 
>
> Key: CASSANDRA-12965
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12965
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Randy Fradin
>
> During a full repair run, I observed one node in my cluster using 100% cpu 
> (100% of all cores on a 48-core machine). When I took a stack trace I found 
> exactly 48 running StreamReceiveTask threads. Each was in the same block of 
> code in StreamReceiveTask.OnCompletionRunnable:
> {noformat}
> "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 
> tid=0x7f01520a8800 nid=0x6e77 runnable [0x7f020dfae000]
>java.lang.Thread.State: RUNNABLE
> at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258)
> at java.util.ComparableTimSort.sort(ComparableTimSort.java:203)
> at java.util.Arrays.sort(Arrays.java:1312)
> at java.util.Arrays.sort(Arrays.java:1506)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:141)
> at 
> org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:257)
> at 
> org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:280)
> at 
> org.apache.cassandra.utils.IntervalTree.(IntervalTree.java:72)
> at 
> org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:590)
> at 
> org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:584)
> at 
> org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565)
> at 
> org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761)
> at 
> org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428)
> at 
> org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283)
> at 
> org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422)
> at 
> org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in 
> the IntervalNode constructor called from the IntervalTree constructor.
> It stayed this way for maybe an hour before we restarted the node. The repair 
> was also generating thousands (20,000+) of tiny SSTables in a table that 
> previously had just 20.
> I don't know enough about SSTables and ColumnFamilyStore to know if all this 
> CPU work is necessary or a bug, but I did notice that these tasks are run on 
> a thread pool constructed in StreamReceiveTask.java, so perhaps this pool 
> should have a thread count max less than the number of processors on the 
> machine, at least for machines with a lot of processors. Any reason not to do 
> that? Any ideas for a reasonable # or formula to cap the thread count?
> Some additional info: We have never run incremental repair on this cluster, 
> so that is not a factor. All our tables use LCS. Unfortunately I don't have 
> the log files from the period saved.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair

2016-12-20 Thread Randy Fradin (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765287#comment-15765287
 ] 

Randy Fradin commented on CASSANDRA-12965:
--

Understood on not fixing in 2.1- will still be nice to see that it's fixed for 
when we upgrade. Here's the info you asked for:

- This happened more than once. We had a data center's worth of nodes down for 
a long period of time (longer than the hinted handoff window) before this 
happened so I am assuming that caused more ranges to be out of sync than usual 
before this repair run. The tables were not particularly big (a few GB total at 
most) so it could not have been a large volume of data that needed be synced, 
but nevertheless it resulted in thousands of SSTables being created on the 
nodes that had been down for a set of tables that normally have ~20ish 
SSTables. After killing repair, running it again would yield the same result. 
We avoided running repair on those particular tables until we could figure out 
what to do. The large number of SSTables caused its own problems that we worked 
around, but separate from that we had this CPU problem resulting from all the 
streaming sessions that created the SSTables.
- We run full (non-incremental) repair with the -pr and -par options. Each run 
is always for a specific table.
- We have around 400 tables in this cluster with varying RFs, but the RF for 
the tables that were causing the issue is 3 per data center across 4 data 
centers. There are 24 nodes total in the cluster and each node has 256 vnodes.
- Yes we have our own repair coordinator that's currently configured to run up 
to 8 repairs at the same time across the cluster.

> StreamReceiveTask causing high CPU utilization during repair
> 
>
> Key: CASSANDRA-12965
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12965
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Randy Fradin
>
> During a full repair run, I observed one node in my cluster using 100% cpu 
> (100% of all cores on a 48-core machine). When I took a stack trace I found 
> exactly 48 running StreamReceiveTask threads. Each was in the same block of 
> code in StreamReceiveTask.OnCompletionRunnable:
> {noformat}
> "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 
> tid=0x7f01520a8800 nid=0x6e77 runnable [0x7f020dfae000]
>java.lang.Thread.State: RUNNABLE
> at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258)
> at java.util.ComparableTimSort.sort(ComparableTimSort.java:203)
> at java.util.Arrays.sort(Arrays.java:1312)
> at java.util.Arrays.sort(Arrays.java:1506)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:141)
> at 
> org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:257)
> at 
> org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:280)
> at 
> org.apache.cassandra.utils.IntervalTree.(IntervalTree.java:72)
> at 
> org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:590)
> at 
> org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:584)
> at 
> org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565)
> at 
> org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761)
> at 
> org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428)
> at 
> org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283)
> at 
> org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422)
> at 
> org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in 
> the IntervalNode constructor called from the IntervalTree constructor.
> It stayed this way for maybe an hour before we restarted the node. The repair 
> was also generating thousands (20,000+) of tiny SSTables in a table that 
> previously had just 20.
> I don't know enough about SSTables and ColumnFamilyStore to know if all this 
> CPU work is necessary or a bug, but I did notice that these tasks are run on 
> a thread pool constructed in StreamReceiveTask.java, so perhaps this pool 
> should have a thread count 

[jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair

2016-12-20 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765216#comment-15765216
 ] 

Paulo Motta commented on CASSANDRA-12965:
-

I'm afraid this will no longer be fixed on 2.1 given it's on critical-fixes 
only mode, but I'd like to understand more about the problem to see if is still 
present on later versions, since this is pretty similar to CASSANDRA-13055. Few 
questions:
- Was this a one-off problem or did it happen more than once?
- What repair command/options do you use?
- How many tables, RF and vnodes?
- Is repair triggered simultaneously in more than one node?

> StreamReceiveTask causing high CPU utilization during repair
> 
>
> Key: CASSANDRA-12965
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12965
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Randy Fradin
>
> During a full repair run, I observed one node in my cluster using 100% cpu 
> (100% of all cores on a 48-core machine). When I took a stack trace I found 
> exactly 48 running StreamReceiveTask threads. Each was in the same block of 
> code in StreamReceiveTask.OnCompletionRunnable:
> {noformat}
> "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 
> tid=0x7f01520a8800 nid=0x6e77 runnable [0x7f020dfae000]
>java.lang.Thread.State: RUNNABLE
> at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258)
> at java.util.ComparableTimSort.sort(ComparableTimSort.java:203)
> at java.util.Arrays.sort(Arrays.java:1312)
> at java.util.Arrays.sort(Arrays.java:1506)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:141)
> at 
> org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:257)
> at 
> org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:280)
> at 
> org.apache.cassandra.utils.IntervalTree.(IntervalTree.java:72)
> at 
> org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:590)
> at 
> org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:584)
> at 
> org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565)
> at 
> org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761)
> at 
> org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428)
> at 
> org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283)
> at 
> org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422)
> at 
> org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in 
> the IntervalNode constructor called from the IntervalTree constructor.
> It stayed this way for maybe an hour before we restarted the node. The repair 
> was also generating thousands (20,000+) of tiny SSTables in a table that 
> previously had just 20.
> I don't know enough about SSTables and ColumnFamilyStore to know if all this 
> CPU work is necessary or a bug, but I did notice that these tasks are run on 
> a thread pool constructed in StreamReceiveTask.java, so perhaps this pool 
> should have a thread count max less than the number of processors on the 
> machine, at least for machines with a lot of processors. Any reason not to do 
> that? Any ideas for a reasonable # or formula to cap the thread count?
> Some additional info: We have never run incremental repair on this cluster, 
> so that is not a factor. All our tables use LCS. Unfortunately I don't have 
> the log files from the period saved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)