[ 
https://issues.apache.org/jira/browse/CASSANDRA-12192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419578#comment-15419578
 ] 

Tyler Hobbs commented on CASSANDRA-12192:
-----------------------------------------

Here's basically what's happening:

The 3.x node tries to add an index while the cluster is half-upgraded.  This 
basically triggers the symptoms from CASSANDRA-12236 (a deserialization error 
on the 3.0.x node).  The 3.0.x node closes its end of the connection, but of 
course the 3.x node won't notice this until it tries to write on that 
connection again.  The truncate is the next thing that happens.  All nodes 
truncate successfully, but when the 3.x node tries to write its response, it 
gets a broken pipe error.  The truncate then times out on the coordinator.

I've opened a [dtest PR|https://github.com/riptano/cassandra-dtest/pull/1229] 
to remove the unnecessary schema change from the test.  However, there is one 
improvement we could make in C* to make these sorts of errors (which especially 
tend to happen during upgrades) less problematic.  We already have a mechanism 
for retrying messages once after reconnecting.  This is currently reserved for 
non-droppable verbs (like schema change messages, repair messages, etc).  It 
seems like retrying _all_ verbs once after reconnecting would be reasonable.  
This approach would also fix the test.  I'm not sure why non-droppable messages 
were omitted from this behavior in CASSANDRA-5393 (where the retry mech was 
created), so there might be some potential issues I'm unaware of.

Can anybody offer a reason why we shouldn't retry _all_ verbs after 
reconnecting?

> dtest failure in 
> upgrade_tests.cql_tests.TestCQLNodes3RF3_Upgrade_current_3_0_x_To_head_trunk.map_keys_indexing_test
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-12192
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12192
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sean McCarthy
>            Assignee: Tyler Hobbs
>              Labels: dtest
>         Attachments: node1.log, node1_debug.log, node1_gc.log, node2.log, 
> node2_debug.log, node2_gc.log, node3.log, node3_debug.log, node3_gc.log
>
>
> example failure:
> http://cassci.datastax.com/job/upgrade_tests-all/59/testReport/upgrade_tests.cql_tests/TestCQLNodes3RF3_Upgrade_current_3_0_x_To_head_trunk/map_keys_indexing_test
> Failed on CassCI build upgrade_tests-all #59
> {code}
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
>     testMethod()
>   File "/home/automaton/cassandra-dtest/tools.py", line 290, in wrapped
>     f(obj)
>   File "/home/automaton/cassandra-dtest/upgrade_tests/cql_tests.py", line 
> 3668, in map_keys_indexing_test
>     cursor.execute("TRUNCATE test")
>   File "cassandra/cluster.py", line 1941, in 
> cassandra.cluster.Session.execute (cassandra/cluster.c:33642)
>     return self.execute_async(query, parameters, trace, custom_payload, 
> timeout, execution_profile).result()
>   File "cassandra/cluster.py", line 3629, in 
> cassandra.cluster.ResponseFuture.result (cassandra/cluster.c:69369)
>     raise self._final_exception
> '<Error from server: code=1003 [Error during truncate] message="Error during 
> truncate: Truncate timed out - received only 2 responses">
> {code}
> Related failure: 
> http://cassci.datastax.com/job/upgrade_tests-all/59/testReport/upgrade_tests.cql_tests/TestCQLNodes2RF1_Upgrade_current_3_0_x_To_head_trunk/map_keys_indexing_test/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to