[ 
https://issues.apache.org/jira/browse/CASSANDRA-8343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150311#comment-15150311
 ] 

Sylvain Lebresne commented on CASSANDRA-8343:
---------------------------------------------

I'm not super familiar with how the streaming protocol number is used. If there 
is some protocol version negotiations between nodes that make it possible to 
bump the number without breaking any backward compatibility, then that would be 
fine for trunk (well, assuming we do carefully test this). Otherwise, we'd have 
to wait for 4.0.

bq. we've had this problem since forever \[...\] there is the workaround of 
increasing {{streaming_socket_timeout}}

I agree that this probably mean it's not worth doing too risky changes for this 
before trunk. But really, it feels to me that the main problem is how the code 
handle this kind of problem. Assuming we probably surface the timeout on the 
sending side, there is not reason not to properly close the session and move on 
on the receiving side when this happen (we could still log an error or warning 
on that receiving side explaining what happens (and that if the sending 
timeouted, the user may want to increase {{streaming_socket_timeout}})). We can 
also document that {{streaming_socket_timeout}} should be high enough to let 
2ndary index/MVs be built in the yaml.

Imo, if we handle the case better (by not breaking anything but logging enough 
info that the user understand what happened and that this is really not a big 
deal), it's fine if we only fix it properly in 4.0 (we do need to have a better 
solution eventually of course).


> Secondary index creation causes moves/bootstraps to fail
> --------------------------------------------------------
>
>                 Key: CASSANDRA-8343
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8343
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Michael Frisch
>            Assignee: Paulo Motta
>
> Node moves/bootstraps are failing if the stream timeout is set to a value in 
> which secondary index creation cannot complete.  This happens because at the 
> end of the very last stream the StreamInSession.closeIfFinished() function 
> calls maybeBuildSecondaryIndexes on every column family.  If the stream time 
> + all CF's index creation takes longer than your stream timeout then the 
> socket closes from the sender's side, the receiver of the stream tries to 
> write to said socket because it's not null, an IOException is thrown but not 
> caught in closeIfFinished(), the exception is caught somewhere and not 
> logged, AbstractStreamSession.close() is never called, and the CountDownLatch 
> is never decremented.  This causes the move/bootstrap to continue forever 
> until the node is restarted.
> This problem of stream time + secondary index creation time exists on 
> decommissioning/unbootstrap as well but since it's on the sending side the 
> timeout triggers the onFailure() callback which does decrement the 
> CountDownLatch leading to completion.
> A cursory glance at the 2.0 code leads me to believe this problem would exist 
> there as well.
> Temporary workaround: set a really high/infinite stream timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to