[
https://issues.apache.org/jira/browse/CASSANDRA-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081858#comment-17081858
]
ZhaoYang edited comment on CASSANDRA-15666 at 4/13/20, 5:28 AM:
----------------------------------------------------------------
|[patch|https://github.com/apache/cassandra/pull/497]|[dtest|https://github.com/apache/cassandra-dtest/pull/63]|
Previous changes:
* Synchronization on "StreamSession#maybeComplete()" to avoid race condition
on streaming completion.
* Only the "follower" is allowed to send the CompleteMessage.
* Only the "initiator" is allowed to close the session and its channels after
receiving the CompleteMessage.
New changes to fix dtest failures:
* NettyStreamingMessageSender:
** don't close channels in NettyStreamingMessageSender, as they will be closed
by initator on "closeSession()"
** handle fileTransferExecutor pool shutdown gracefully in case of
ClosedByInterruptException
** only include inbound handler for initiator's control channel
* ChannelProxy:
** Use new "ChannelProxy" instance instead of shared copy in stream writer to
prevent interruped thread closing backing channel
* StreamSession: close session if channel is closed due to EOF from
"StreamingInboundHandler"
* OutboundConnection: fix OutboundConnection.id() to use proper remote/local
address
* Preview repair:
** Move follower's "completePreview()" from "prepareAck()" to "prepareAsync()"
because initiator will close connection via "completePreview()" on
"prepareSynAck()". Without this change, preview_repair_test.py will fail
because initiator closes channels before follower moving to final state and
follower "StreamSession" will log some error upon EOF.
** In case of preview, do not send "PrepareAckMessage" to follower, as
follower has already closed stream session on "prepareAsync()"
* Dtest: include "Socket closed before session completion" into
{{ignore_log_patterns in
"{{repair_test.py::TestRepair::test_dead_sync_participant"}}}}, it's logged
when stream session was closed upon EOF due to node down.
was (Author: jasonstack):
|[patch|https://github.com/apache/cassandra/pull/497]|[dtest|https://github.com/apache/cassandra-dtest/pull/63]|
Previous changes:
* Synchronization on "StreamSession#maybeComplete()" to avoid race condition
on streaming completion.
* Only the "follower" is allowed to send the CompleteMessage.
* Only the "initiator" is allowed to close the session and its channels after
receiving the CompleteMessage.
New changes to fix dtest failures:
* NettyStreamingMessageSender:
** don't close channels in NettyStreamingMessageSender, as they will be closed
by initator on "closeSession()"
** handle fileTransferExecutor pool shutdown gracefully in case of
ClosedByInterruptException
** only include inbound handler for initiator's control channel
* ChannelProxy:
** Use new "ChannelProxy" instance instead of shared copy in stream writer to
prevent interruped thread closing backing channel
* StreamSession: close session if channel is closed due to EOF from
"StreamingInboundHandler"
* OutboundConnection: fix OutboundConnection.id() to use proper remote/local
address
* Preview repair:
** Move follower's "completePreview()" from "prepareAck()" to "prepareAsync()"
because initiator will close connection via "completePreview()" on
"prepareSynAck()". Without this change, preview_repair_test.py will fail
because initiator closes channels before follower moving to final state and
follower "StreamSession" will log some error upon EOF.
** In case of preview, do not send "PrepareAckMessage" to follower, as
follower has already closed stream session on "prepareAsync()"
* Dtest: include "Socket closed before session completion" into
{{ignore_log_patterns}}, it's logged when stream session was closed upon EOF
due to node down.
> Race condition when completing stream sessions
> ----------------------------------------------
>
> Key: CASSANDRA-15666
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15666
> Project: Cassandra
> Issue Type: Bug
> Components: Legacy/Streaming and Messaging
> Reporter: Sergio Bossa
> Assignee: ZhaoYang
> Priority: Normal
> Labels: pull-request-available
> Fix For: 4.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> {{StreamSession#prepareAsync()}} executes, as the name implies,
> asynchronously from the IO thread: this opens up for race conditions between
> the sending of the {{PrepareSynAckMessage}} and the call to
> {{StreamSession#maybeCompleted()}}. I.e., the following could happen:
> 1) Node A sends {{PrepareSynAckMessage}} from the {{prepareAsync()}} thread.
> 2) Node B receives it and starts streaming.
> 3) Node A receives the streamed file and sends {{ReceivedMessage}}.
> 4) At this point, if this was the only file to stream, both nodes are ready
> to close the session via {{maybeCompleted()}}, but:
> a) Node A will call it twice from both the IO thread and the thread at #1,
> closing the session and its channels.
> b) Node B will attempt to send a {{CompleteMessage}}, but will fail because
> the session has been closed in the meantime.
> There are other subtle variations of the pattern above, depending on the
> order of concurrently sent/received messages.
> I believe the best fix would be to modify the message exchange so that:
> 1) Only the "follower" is allowed to send the {{CompleteMessage}}.
> 2) Only the "initiator" is allowed to close the session and its channels
> after receiving the {{CompleteMessage}}.
> By doing so, the message exchange logic would be easier to reason about,
> which is overall a win anyway.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]