[
https://issues.apache.org/jira/browse/CASSANDRA-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Erik Onnen updated CASSANDRA-1766:
----------------------------------
Attachment: CASSANDRA-1766.patch
Not sure it's exactly related but I encountered an issue where a stream failed
post AE and was just wedged with the following stack trace:
"STREAM-STAGE:1" prio=10 tid=0x00007ff2440a5800 nid=0x3c3c in Object.wait()
[0x00007ff24a21f000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007ff28884fad8> (a
org.apache.cassandra.utils.SimpleCondition)
at java.lang.Object.wait(Object.java:485)
at
org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:38)
- locked <0x00007ff28884fad8> (a
org.apache.cassandra.utils.SimpleCondition)
at
org.apache.cassandra.streaming.StreamOutManager.waitForStreamCompletion(StreamOutManager.java:164)
at
org.apache.cassandra.streaming.StreamOut.transferSSTables(StreamOut.java:138)
at
org.apache.cassandra.service.AntiEntropyService$Differencer$1.runMayThrow(AntiEntropyService.java:511)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
We suspect that this occurred because the destination node was in drain state,
although from reading the code it appears that any failed stream where the
destination goes away would be susceptible to this issue. In this case, the
StreamManager will never unblock making subsequent repairs to any node that was
pending transfer impossible.
I've attached a patch that smooths out some possible streaming issues:
* Catches streaming errors. Near as I can tell, if an error occurred during
streaming because the remote node went away, it would bubble all the way out of
the executor and not even be logged. Worse, it would keep the current pending
file wedged and never allow it to be cleared. This patch will remove the failed
transfer when an IOException occurs. Could be it should be more general
* Allows for manual purging of pending files to a host via JMX which means
un-sticking a wedged transfer no-longer requires a restart of that node. It
also unfortunately results in removal of the file which could require
anti-compaction again but this was the least painful path through the code.
* Corrects an unlikely but potentially fatal scenario where concurrent
mutation/read from the file and fileMap references could result in dirty reads
by making them concurrency-safe collections. Only way I could see this
happening is if someone were to run repair multiple times in succession while
streaming was happening. Unlikely but possible and the effects on unsafe map
reads can result in a completely unresponsive JVM.
I'm not entirely sure this is the right thing to do but I though I'd float it
out there for review. Whatever the correct fix, I think there needs to be a way
to cancel pending streams so that they aren't stuck.
> Streaming never makes progress
> ------------------------------
>
> Key: CASSANDRA-1766
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1766
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 0.6.7
> Reporter: Brandon Williams
> Fix For: 0.6.9
>
> Attachments: CASSANDRA-1766.patch
>
>
> I have a client that can never complete a bootstrap. AC finishes, streaming
> begins. Stream initiate completes, and the sources wait on the transfer to
> finish, but progress is never made on any stream. Nodetool reports streaming
> is happening, the socket is held open, but nothing happens.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.