[
https://issues.apache.org/jira/browse/CASSANDRA-6992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641284#comment-14641284
]
Paulo Motta commented on CASSANDRA-6992:
----------------------------------------
Tried reproducing this on my SSD laptop on a 3-node ccm cluster (10GB each),
but the CPU/IO capacity is quickly exhausted, so there's an impact in
load/latency but during the whole bootstrapping/streaming process, not only
during the preparation phase. The problem will definitely be more apparent with
large overloaded clusters and heterogeneous workloads, and even more with hard
disks.
I haven't identified major changes in this section of the code (except for the
new streaming protocol) since 1.2, so I think we can assume the limitation is
still present on 2.1+. and move on with implementation.
I think a a simple algorithm that waits until a streaming session is
established before starting the session with the next node should be sufficient
to prevent storms during bootstrap. We could also provide two tuning properties:
bootstrap_staggering_concurrency: 1 #number of nodes to establish bootstrap
streaming session in parallel
bootstrap_staggering_interval_seconds: 60 #wait time before establishing a
session with the next node
Any thoughts? Which version should we aim for? [~jbellis]
> Bootstrap on vnodes clusters can cause stampeding/storm behavior
> ----------------------------------------------------------------
>
> Key: CASSANDRA-6992
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6992
> Project: Cassandra
> Issue Type: Improvement
> Environment: Various vnodes-enabled clusters in EC2, m1.xlarge and
> hi1.4xlarge, ~3000-8000 tokens.
> Reporter: Rick Branson
> Assignee: Paulo Motta
> Priority: Minor
>
> Assuming this is an issue with vnodes clusters because
> SSTableReader#getPositionsForRanges is more expensive to compute with 256x
> the ranges, but could be wrong. On even well-provisioned hosts, this can
> cause a severe spike in network throughput & CPU utilization from a storm of
> flushes, which impacts long-tail times pretty badly. On weaker hosts (like
> m1.xlarge with ~500GB of data), it can result in minutes of churn while the
> node gets through StreamOut#createPendingFiles. This *might* be better in
> 2.0, but it's probably still reproducible because the bootstrapping node
> sends out all of it's streaming requests at once.
> I'm thinking that this could be staggered at the bootstrapping node to avoid
> the simultaneous spike across the whole cluster. Not sure on how to stagger
> it besides something very naive like one-at-a-time with a pause. Maybe this
> should also be throttled in StreamOut#createPendingFiles on the out-streaming
> host? Any thoughts?
> From the stack dump of one of our weaker nodes that was struggling for a few
> minutes just starting the StreamOut:
> "MiscStage:1" daemon prio=10 tid=0x000000000292f000 nid=0x688 runnable
> [0x00007f7b03df6000]
> java.lang.Thread.State: RUNNABLE
> at
> org.apache.cassandra.utils.ByteBufferUtil.readShortLength(ByteBufferUtil.java:361)
> at
> org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:371)
> at
> org.apache.cassandra.io.sstable.IndexHelper$IndexInfo.deserialize(IndexHelper.java:187)
> at
> org.apache.cassandra.db.RowIndexEntry$Serializer.deserialize(RowIndexEntry.java:125)
> at
> org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:889)
> at
> org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:790)
> at
> org.apache.cassandra.io.sstable.SSTableReader.getPositionsForRanges(SSTableReader.java:730)
> at
> org.apache.cassandra.streaming.StreamOut.createPendingFiles(StreamOut.java:172)
> at
> org.apache.cassandra.streaming.StreamOut.transferSSTables(StreamOut.java:157)
> at
> org.apache.cassandra.streaming.StreamOut.transferRanges(StreamOut.java:148)
> at
> org.apache.cassandra.streaming.StreamOut.transferRanges(StreamOut.java:116)
> at
> org.apache.cassandra.streaming.StreamRequestVerbHandler.doVerb(StreamRequestVerbHandler.java:44)
> at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)