[
https://issues.apache.org/jira/browse/CASSANDRA-8343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150311#comment-15150311
]
Sylvain Lebresne commented on CASSANDRA-8343:
---------------------------------------------
I'm not super familiar with how the streaming protocol number is used. If there
is some protocol version negotiations between nodes that make it possible to
bump the number without breaking any backward compatibility, then that would be
fine for trunk (well, assuming we do carefully test this). Otherwise, we'd have
to wait for 4.0.
bq. we've had this problem since forever \[...\] there is the workaround of
increasing {{streaming_socket_timeout}}
I agree that this probably mean it's not worth doing too risky changes for this
before trunk. But really, it feels to me that the main problem is how the code
handle this kind of problem. Assuming we probably surface the timeout on the
sending side, there is not reason not to properly close the session and move on
on the receiving side when this happen (we could still log an error or warning
on that receiving side explaining what happens (and that if the sending
timeouted, the user may want to increase {{streaming_socket_timeout}})). We can
also document that {{streaming_socket_timeout}} should be high enough to let
2ndary index/MVs be built in the yaml.
Imo, if we handle the case better (by not breaking anything but logging enough
info that the user understand what happened and that this is really not a big
deal), it's fine if we only fix it properly in 4.0 (we do need to have a better
solution eventually of course).
> Secondary index creation causes moves/bootstraps to fail
> --------------------------------------------------------
>
> Key: CASSANDRA-8343
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8343
> Project: Cassandra
> Issue Type: Bug
> Reporter: Michael Frisch
> Assignee: Paulo Motta
>
> Node moves/bootstraps are failing if the stream timeout is set to a value in
> which secondary index creation cannot complete. This happens because at the
> end of the very last stream the StreamInSession.closeIfFinished() function
> calls maybeBuildSecondaryIndexes on every column family. If the stream time
> + all CF's index creation takes longer than your stream timeout then the
> socket closes from the sender's side, the receiver of the stream tries to
> write to said socket because it's not null, an IOException is thrown but not
> caught in closeIfFinished(), the exception is caught somewhere and not
> logged, AbstractStreamSession.close() is never called, and the CountDownLatch
> is never decremented. This causes the move/bootstrap to continue forever
> until the node is restarted.
> This problem of stream time + secondary index creation time exists on
> decommissioning/unbootstrap as well but since it's on the sending side the
> timeout triggers the onFailure() callback which does decrement the
> CountDownLatch leading to completion.
> A cursory glance at the 2.0 code leads me to believe this problem would exist
> there as well.
> Temporary workaround: set a really high/infinite stream timeout.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)