[
https://issues.apache.org/jira/browse/CASSANDRA-8620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367968#comment-14367968
]
Philip Thompson commented on CASSANDRA-8620:
--------------------------------------------
Sorry for the delay. Are you still seeing this issue? Have you reproduced since
the original occurrence?
> Bootstrap session hanging indefinitely
> --------------------------------------
>
> Key: CASSANDRA-8620
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8620
> Project: Cassandra
> Issue Type: Bug
> Environment: Debian 7, Oracle JDK 1.7.0_51, AWS + GCE
> Reporter: Adam Horwich
>
> Hi! We have been running a relatively small 2.1.2 cluster over 2 DCs for a
> few months with ~100GB load per node and a RF=3 and over the last few weeks
> have been trying to scale up capacity.
> We've been recently seeing scenarios in which the Bootstrap or Unbootstrap
> streaming process hangs indefinitely for one or more sessions on the receiver
> without stacktrace or exception. This does not happen every time, and we do
> not get into this state with the same sender every time. When the receiver is
> in a hung state, the following can be found in the thread dump:
> The Stream-IN thread for one or more sessions is blocked in the following
> state:
> Thread 24942: (state = BLOCKED)
> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
> may be imprecise)
> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
> line=186 (Compiled frame)
> -
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await()
> @bci=42, line=2043 (Compiled frame)
> - java.util.concurrent.ArrayBlockingQueue.take() @bci=20, line=374 (Compiled
> frame)
> - org.apache.cassandra.streaming.compress.CompressedInputStream.read()
> @bci=31, line=89 (Compiled frame)
> - java.io.DataInputStream.readUnsignedShort() @bci=4, line=337 (Compiled
> frame)
> - org.apache.cassandra.utils.BytesReadTracker.readUnsignedShort() @bci=4,
> line=140 (Compiled frame)
> -
> org.apache.cassandra.utils.ByteBufferUtil.readShortLength(java.io.DataInput)
> @bci=1, line=317 (Compiled frame)
> -
> org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(java.io.DataInput)
> @bci=2, line=327 (Compiled frame)
> -
> org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(java.io.DataInput)
> @bci=5, line=397 (Compiled frame)
> -
> org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(java.io.DataInput)
> @bci=2, line=381 (Compiled frame)
> -
> org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(java.io.DataInput,
> org.apache.cassandra.db.ColumnSerializer$Flag, int,
> org.apache.cassandra.io.sstable.Descriptor$Version) @bci=10, line=75
> (Compiled frame)
> - org.apache.cassandra.db.AbstractCell$1.computeNext() @bci=25, line=52
> (Compiled frame)
> - org.apache.cassandra.db.AbstractCell$1.computeNext() @bci=1, line=46
> (Compiled frame)
> - com.google.common.collect.AbstractIterator.tryToComputeNext() @bci=9,
> line=143 (Compiled frame)
> - com.google.common.collect.AbstractIterator.hasNext() @bci=61, line=138
> (Compiled frame)
> -
> org.apache.cassandra.io.sstable.SSTableWriter.appendFromStream(org.apache.cassandra.db.DecoratedKey,
> org.apache.cassandra.config.CFMetaData, java.io.DataInput,
> org.apache.cassandra.io.sstable.Descriptor$Version) @bci=320, line=283
> (Compiled frame)
> -
> org.apache.cassandra.streaming.StreamReader.writeRow(org.apache.cassandra.io.sstable.SSTableWriter,
> java.io.DataInput, org.apache.cassandra.db.ColumnFamilyStore) @bci=26,
> line=157 (Compiled frame)
> -
> org.apache.cassandra.streaming.compress.CompressedStreamReader.read(java.nio.channels.ReadableByteChannel)
> @bci=258, line=89 (Compiled frame)
> -
> org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(java.nio.channels.ReadableByteChannel,
> int, org.apache.cassandra.streaming.StreamSession) @bci=69, line=48
> (Interpreted frame)
> -
> org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(java.nio.channels.ReadableByteChannel,
> int, org.apache.cassandra.streaming.StreamSession) @bci=4, line=38
> (Interpreted frame)
> -
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(java.nio.channels.ReadableByteChannel,
> int, org.apache.cassandra.streaming.StreamSession) @bci=37, line=55
> (Interpreted frame)
> -
> org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run()
> @bci=24, line=245 (Interpreted frame)
> - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)
> Debug logging shows that the receiver is still reading the file it is
> receiving from the receiver and has not yet sent an ACK.
> The receiver is waiting for more data to finish writing its row, and the
> sender is not sending any more data. On both the receiver and sender there is
> a large amount of data (~5MB) stuck in the Recv-Q (receiver) and Send-Q
> (sender).
> We've been trying to diagnose this issue internally, but it's difficult to
> create a reliably reproducible scenario. We have so far found that restarting
> all nodes in the cluster and ensuring that a cleanup has been performed helps
> mitigate the problem (though a cleanup without restart can still result in a
> hung state). However, it's unclear to me why either of these would affect the
> streaming process in the way we've observed. One theory is that the
> calculated section sizes are inaccurate.
> We tried setting a timeout on the dataBuffer reading (moving take to poll),
> which forced a retry on the file, but the same transfer failed again until
> retries were exhausted.
> From a heap dump in the dead locked state we confirmed the following:
> ByteTracker.bytesRead = 100477411
> Length of the section being read (happens to be the last section) 100477411
> We also wrote all the data in the buffer to a text file and observed that the
> data in the buffer ended half way through a row.
> The stream reader checks the value of bytes read after importing each row
> (StreamReader line 97) to detect the end of the section, however it does not
> expect the data to finish half way through a row. This happens in
> OnDiskAtom.deserializeFromSSTable in any of the various deserialization calls.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)