[jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair
[ https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091147#comment-16091147 ] Jeff Jirsa commented on CASSANDRA-12965: Relating to CASSANDRA-11303 , which is a "rethink inbound streaming throughput throttle" ticket, which would let us better tune this sort of behavior. > StreamReceiveTask causing high CPU utilization during repair > > > Key: CASSANDRA-12965 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12965 > Project: Cassandra > Issue Type: Bug >Reporter: Randy Fradin > > During a full repair run, I observed one node in my cluster using 100% cpu > (100% of all cores on a 48-core machine). When I took a stack trace I found > exactly 48 running StreamReceiveTask threads. Each was in the same block of > code in StreamReceiveTask.OnCompletionRunnable: > {noformat} > "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 > tid=0x7f01520a8800 nid=0x6e77 runnable [0x7f020dfae000] >java.lang.Thread.State: RUNNABLE > at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258) > at java.util.ComparableTimSort.sort(ComparableTimSort.java:203) > at java.util.Arrays.sort(Arrays.java:1312) > at java.util.Arrays.sort(Arrays.java:1506) > at java.util.ArrayList.sort(ArrayList.java:1454) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:257) > at > org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:280) > at > org.apache.cassandra.utils.IntervalTree.(IntervalTree.java:72) > at > org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:590) > at > org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:584) > at > org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565) > at > org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761) > at > org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428) > at > org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283) > at > org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422) > at > org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in > the IntervalNode constructor called from the IntervalTree constructor. > It stayed this way for maybe an hour before we restarted the node. The repair > was also generating thousands (20,000+) of tiny SSTables in a table that > previously had just 20. > I don't know enough about SSTables and ColumnFamilyStore to know if all this > CPU work is necessary or a bug, but I did notice that these tasks are run on > a thread pool constructed in StreamReceiveTask.java, so perhaps this pool > should have a thread count max less than the number of processors on the > machine, at least for machines with a lot of processors. Any reason not to do > that? Any ideas for a reasonable # or formula to cap the thread count? > Some additional info: We have never run incremental repair on this cluster, > so that is not a factor. All our tables use LCS. Unfortunately I don't have > the log files from the period saved. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair
[ https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090973#comment-16090973 ] Randy Fradin commented on CASSANDRA-12965: -- We set -Dcassandra.available_processors=(some number less than the # of cores on the host) as suggested by lieangsibin. It limits the size of several thread pools including this one. Not exactly a fix but at least prevents Cassandra from monopolizing all of the CPU resources. > StreamReceiveTask causing high CPU utilization during repair > > > Key: CASSANDRA-12965 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12965 > Project: Cassandra > Issue Type: Bug >Reporter: Randy Fradin > > During a full repair run, I observed one node in my cluster using 100% cpu > (100% of all cores on a 48-core machine). When I took a stack trace I found > exactly 48 running StreamReceiveTask threads. Each was in the same block of > code in StreamReceiveTask.OnCompletionRunnable: > {noformat} > "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 > tid=0x7f01520a8800 nid=0x6e77 runnable [0x7f020dfae000] >java.lang.Thread.State: RUNNABLE > at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258) > at java.util.ComparableTimSort.sort(ComparableTimSort.java:203) > at java.util.Arrays.sort(Arrays.java:1312) > at java.util.Arrays.sort(Arrays.java:1506) > at java.util.ArrayList.sort(ArrayList.java:1454) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:257) > at > org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:280) > at > org.apache.cassandra.utils.IntervalTree.(IntervalTree.java:72) > at > org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:590) > at > org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:584) > at > org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565) > at > org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761) > at > org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428) > at > org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283) > at > org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422) > at > org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in > the IntervalNode constructor called from the IntervalTree constructor. > It stayed this way for maybe an hour before we restarted the node. The repair > was also generating thousands (20,000+) of tiny SSTables in a table that > previously had just 20. > I don't know enough about SSTables and ColumnFamilyStore to know if all this > CPU work is necessary or a bug, but I did notice that these tasks are run on > a thread pool constructed in StreamReceiveTask.java, so perhaps this pool > should have a thread count max less than the number of processors on the > machine, at least for machines with a lot of processors. Any reason not to do > that? Any ideas for a reasonable # or formula to cap the thread count? > Some additional info: We have never run incremental repair on this cluster, > so that is not a factor. All our tables use LCS. Unfortunately I don't have > the log files from the period saved. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair
[ https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089687#comment-16089687 ] Guillaume Drot commented on CASSANDRA-12965: [~rbfblk] Did you find a solution to your problem ? We have the same problem since we have upgraded to DSE OpsCenter 6. [~pauloricardomg] Did you had time to investigate the issue ? Thanks. > StreamReceiveTask causing high CPU utilization during repair > > > Key: CASSANDRA-12965 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12965 > Project: Cassandra > Issue Type: Bug >Reporter: Randy Fradin > > During a full repair run, I observed one node in my cluster using 100% cpu > (100% of all cores on a 48-core machine). When I took a stack trace I found > exactly 48 running StreamReceiveTask threads. Each was in the same block of > code in StreamReceiveTask.OnCompletionRunnable: > {noformat} > "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 > tid=0x7f01520a8800 nid=0x6e77 runnable [0x7f020dfae000] >java.lang.Thread.State: RUNNABLE > at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258) > at java.util.ComparableTimSort.sort(ComparableTimSort.java:203) > at java.util.Arrays.sort(Arrays.java:1312) > at java.util.Arrays.sort(Arrays.java:1506) > at java.util.ArrayList.sort(ArrayList.java:1454) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:257) > at > org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:280) > at > org.apache.cassandra.utils.IntervalTree.(IntervalTree.java:72) > at > org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:590) > at > org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:584) > at > org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565) > at > org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761) > at > org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428) > at > org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283) > at > org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422) > at > org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in > the IntervalNode constructor called from the IntervalTree constructor. > It stayed this way for maybe an hour before we restarted the node. The repair > was also generating thousands (20,000+) of tiny SSTables in a table that > previously had just 20. > I don't know enough about SSTables and ColumnFamilyStore to know if all this > CPU work is necessary or a bug, but I did notice that these tasks are run on > a thread pool constructed in StreamReceiveTask.java, so perhaps this pool > should have a thread count max less than the number of processors on the > machine, at least for machines with a lot of processors. Any reason not to do > that? Any ideas for a reasonable # or formula to cap the thread count? > Some additional info: We have never run incremental repair on this cluster, > so that is not a factor. All our tables use LCS. Unfortunately I don't have > the log files from the period saved. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair
[ https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887642#comment-15887642 ] liangsibin commented on CASSANDRA-12965: maybe we can add -Dcassandra.available_processors=20 to lower the StreamReceiveTask threads when cassandra startup. > StreamReceiveTask causing high CPU utilization during repair > > > Key: CASSANDRA-12965 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12965 > Project: Cassandra > Issue Type: Bug >Reporter: Randy Fradin > > During a full repair run, I observed one node in my cluster using 100% cpu > (100% of all cores on a 48-core machine). When I took a stack trace I found > exactly 48 running StreamReceiveTask threads. Each was in the same block of > code in StreamReceiveTask.OnCompletionRunnable: > {noformat} > "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 > tid=0x7f01520a8800 nid=0x6e77 runnable [0x7f020dfae000] >java.lang.Thread.State: RUNNABLE > at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258) > at java.util.ComparableTimSort.sort(ComparableTimSort.java:203) > at java.util.Arrays.sort(Arrays.java:1312) > at java.util.Arrays.sort(Arrays.java:1506) > at java.util.ArrayList.sort(ArrayList.java:1454) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:257) > at > org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:280) > at > org.apache.cassandra.utils.IntervalTree.(IntervalTree.java:72) > at > org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:590) > at > org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:584) > at > org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565) > at > org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761) > at > org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428) > at > org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283) > at > org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422) > at > org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in > the IntervalNode constructor called from the IntervalTree constructor. > It stayed this way for maybe an hour before we restarted the node. The repair > was also generating thousands (20,000+) of tiny SSTables in a table that > previously had just 20. > I don't know enough about SSTables and ColumnFamilyStore to know if all this > CPU work is necessary or a bug, but I did notice that these tasks are run on > a thread pool constructed in StreamReceiveTask.java, so perhaps this pool > should have a thread count max less than the number of processors on the > machine, at least for machines with a lot of processors. Any reason not to do > that? Any ideas for a reasonable # or formula to cap the thread count? > Some additional info: We have never run incremental repair on this cluster, > so that is not a factor. All our tables use LCS. Unfortunately I don't have > the log files from the period saved. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair
[ https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885877#comment-15885877 ] liangsibin commented on CASSANDRA-12965: when I use four nodes bulkload data into cassandra, sometimes one of the nodes cpu usage is almost 100%, which make bulkload very slow.Is there any method can solve this problem? help > StreamReceiveTask causing high CPU utilization during repair > > > Key: CASSANDRA-12965 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12965 > Project: Cassandra > Issue Type: Bug >Reporter: Randy Fradin > > During a full repair run, I observed one node in my cluster using 100% cpu > (100% of all cores on a 48-core machine). When I took a stack trace I found > exactly 48 running StreamReceiveTask threads. Each was in the same block of > code in StreamReceiveTask.OnCompletionRunnable: > {noformat} > "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 > tid=0x7f01520a8800 nid=0x6e77 runnable [0x7f020dfae000] >java.lang.Thread.State: RUNNABLE > at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258) > at java.util.ComparableTimSort.sort(ComparableTimSort.java:203) > at java.util.Arrays.sort(Arrays.java:1312) > at java.util.Arrays.sort(Arrays.java:1506) > at java.util.ArrayList.sort(ArrayList.java:1454) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:257) > at > org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:280) > at > org.apache.cassandra.utils.IntervalTree.(IntervalTree.java:72) > at > org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:590) > at > org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:584) > at > org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565) > at > org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761) > at > org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428) > at > org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283) > at > org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422) > at > org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in > the IntervalNode constructor called from the IntervalTree constructor. > It stayed this way for maybe an hour before we restarted the node. The repair > was also generating thousands (20,000+) of tiny SSTables in a table that > previously had just 20. > I don't know enough about SSTables and ColumnFamilyStore to know if all this > CPU work is necessary or a bug, but I did notice that these tasks are run on > a thread pool constructed in StreamReceiveTask.java, so perhaps this pool > should have a thread count max less than the number of processors on the > machine, at least for machines with a lot of processors. Any reason not to do > that? Any ideas for a reasonable # or formula to cap the thread count? > Some additional info: We have never run incremental repair on this cluster, > so that is not a factor. All our tables use LCS. Unfortunately I don't have > the log files from the period saved. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair
[ https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765287#comment-15765287 ] Randy Fradin commented on CASSANDRA-12965: -- Understood on not fixing in 2.1- will still be nice to see that it's fixed for when we upgrade. Here's the info you asked for: - This happened more than once. We had a data center's worth of nodes down for a long period of time (longer than the hinted handoff window) before this happened so I am assuming that caused more ranges to be out of sync than usual before this repair run. The tables were not particularly big (a few GB total at most) so it could not have been a large volume of data that needed be synced, but nevertheless it resulted in thousands of SSTables being created on the nodes that had been down for a set of tables that normally have ~20ish SSTables. After killing repair, running it again would yield the same result. We avoided running repair on those particular tables until we could figure out what to do. The large number of SSTables caused its own problems that we worked around, but separate from that we had this CPU problem resulting from all the streaming sessions that created the SSTables. - We run full (non-incremental) repair with the -pr and -par options. Each run is always for a specific table. - We have around 400 tables in this cluster with varying RFs, but the RF for the tables that were causing the issue is 3 per data center across 4 data centers. There are 24 nodes total in the cluster and each node has 256 vnodes. - Yes we have our own repair coordinator that's currently configured to run up to 8 repairs at the same time across the cluster. > StreamReceiveTask causing high CPU utilization during repair > > > Key: CASSANDRA-12965 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12965 > Project: Cassandra > Issue Type: Bug >Reporter: Randy Fradin > > During a full repair run, I observed one node in my cluster using 100% cpu > (100% of all cores on a 48-core machine). When I took a stack trace I found > exactly 48 running StreamReceiveTask threads. Each was in the same block of > code in StreamReceiveTask.OnCompletionRunnable: > {noformat} > "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 > tid=0x7f01520a8800 nid=0x6e77 runnable [0x7f020dfae000] >java.lang.Thread.State: RUNNABLE > at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258) > at java.util.ComparableTimSort.sort(ComparableTimSort.java:203) > at java.util.Arrays.sort(Arrays.java:1312) > at java.util.Arrays.sort(Arrays.java:1506) > at java.util.ArrayList.sort(ArrayList.java:1454) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:257) > at > org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:280) > at > org.apache.cassandra.utils.IntervalTree.(IntervalTree.java:72) > at > org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:590) > at > org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:584) > at > org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565) > at > org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761) > at > org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428) > at > org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283) > at > org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422) > at > org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in > the IntervalNode constructor called from the IntervalTree constructor. > It stayed this way for maybe an hour before we restarted the node. The repair > was also generating thousands (20,000+) of tiny SSTables in a table that > previously had just 20. > I don't know enough about SSTables and ColumnFamilyStore to know if all this > CPU work is necessary or a bug, but I did notice that these tasks are run on > a thread pool constructed in StreamReceiveTask.java, so perhaps this pool > should have a thread count
[jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair
[ https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765216#comment-15765216 ] Paulo Motta commented on CASSANDRA-12965: - I'm afraid this will no longer be fixed on 2.1 given it's on critical-fixes only mode, but I'd like to understand more about the problem to see if is still present on later versions, since this is pretty similar to CASSANDRA-13055. Few questions: - Was this a one-off problem or did it happen more than once? - What repair command/options do you use? - How many tables, RF and vnodes? - Is repair triggered simultaneously in more than one node? > StreamReceiveTask causing high CPU utilization during repair > > > Key: CASSANDRA-12965 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12965 > Project: Cassandra > Issue Type: Bug >Reporter: Randy Fradin > > During a full repair run, I observed one node in my cluster using 100% cpu > (100% of all cores on a 48-core machine). When I took a stack trace I found > exactly 48 running StreamReceiveTask threads. Each was in the same block of > code in StreamReceiveTask.OnCompletionRunnable: > {noformat} > "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 > tid=0x7f01520a8800 nid=0x6e77 runnable [0x7f020dfae000] >java.lang.Thread.State: RUNNABLE > at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258) > at java.util.ComparableTimSort.sort(ComparableTimSort.java:203) > at java.util.Arrays.sort(Arrays.java:1312) > at java.util.Arrays.sort(Arrays.java:1506) > at java.util.ArrayList.sort(ArrayList.java:1454) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:257) > at > org.apache.cassandra.utils.IntervalTree$IntervalNode.(IntervalTree.java:280) > at > org.apache.cassandra.utils.IntervalTree.(IntervalTree.java:72) > at > org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:590) > at > org.apache.cassandra.db.DataTracker$SSTableIntervalTree.(DataTracker.java:584) > at > org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565) > at > org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761) > at > org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428) > at > org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283) > at > org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422) > at > org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in > the IntervalNode constructor called from the IntervalTree constructor. > It stayed this way for maybe an hour before we restarted the node. The repair > was also generating thousands (20,000+) of tiny SSTables in a table that > previously had just 20. > I don't know enough about SSTables and ColumnFamilyStore to know if all this > CPU work is necessary or a bug, but I did notice that these tasks are run on > a thread pool constructed in StreamReceiveTask.java, so perhaps this pool > should have a thread count max less than the number of processors on the > machine, at least for machines with a lot of processors. Any reason not to do > that? Any ideas for a reasonable # or formula to cap the thread count? > Some additional info: We have never run incremental repair on this cluster, > so that is not a factor. All our tables use LCS. Unfortunately I don't have > the log files from the period saved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)