[ 
https://issues.apache.org/jira/browse/CASSANDRA-13196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksandr Sorokoumov updated CASSANDRA-13196:
---------------------------------------------
    Reviewer: Alex Petrov
      Status: Patch Available  (was: Open)

The failure in the test ("keyspace keyspace1 does not exist") happened because 
during the pre-bootstrap schema migration all the migration tasks failed to 
complete and the node was bootstrapped with schema being out of sync.
{{MigrationManager.waitUntilReadyForBootstrap}} (which is invoked by 
{{StorageService.waitForSchema}}) just waits for the inflight tasks to finish 
and discards ones that take longer than {{MIGRATION_TASK_WAIT_IN_SECONDS}} to 
complete.
Schema migration tasks are scheduled when there is a big change in an endpoint 
state - it joins the cluster, becomes alive or its schema version has changed.

The idea is that it is safe to restart the migration task if it has timed out 
because either the task will succeed on one of the next retries or will be 
eventually killed by {{FailureDetector}} if the endpoint is marked as 
unreachable.
AFAIU there will be at least one migration task per endpoint. With the retry 
mechanism {{MigrationManager.waitUntilReadyForBootstrap}} will run until 
migration tasks to all the reachable nodes succeed.
This means that either we will receive the migration data from at least one of 
the nodes or all the nodes will be unreachable, but then the bootstrap is 
supposed to fail anyway.

*Steps to reproduce*

To test the retry, I commented out sending reply in 
{{org.apache.cassandra.schema.SchemaPullVerbHandler.doVerb}} and ran the 
original 
{{snitch_test.TestGossipingPropertyFileSnitch.test_prefer_local_reconnect_on_listen_address}}
 test.
_NB:_ the test will run forever because without response the migration requests 
timeout and then being restarted.

*Code*
https://github.com/Gerrrr/cassandra/tree/13196-3.11

*CI builds*:

* https://cassci.datastax.com/job/ifesdjeen-13196-trunk-dtest/
* https://cassci.datastax.com/job/ifesdjeen-13196-trunk-testall/

> test failure in 
> snitch_test.TestGossipingPropertyFileSnitch.test_prefer_local_reconnect_on_listen_address
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13196
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13196
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Michael Shuler
>            Assignee: Aleksandr Sorokoumov
>              Labels: dtest, test-failure
>         Attachments: node1_debug.log, node1_gc.log, node1.log, 
> node2_debug.log, node2_gc.log, node2.log
>
>
> example failure:
> http://cassci.datastax.com/job/trunk_dtest/1487/testReport/snitch_test/TestGossipingPropertyFileSnitch/test_prefer_local_reconnect_on_listen_address
> {code}
> {novnode}
> Error Message
> Error from server: code=2200 [Invalid query] message="keyspace keyspace1 does 
> not exist"
> -------------------- >> begin captured logging << --------------------
> dtest: DEBUG: cluster ccm directory: /tmp/dtest-k6b0iF
> dtest: DEBUG: Done setting configuration options:
> {   'initial_token': None,
>     'num_tokens': '32',
>     'phi_convict_threshold': 5,
>     'range_request_timeout_in_ms': 10000,
>     'read_request_timeout_in_ms': 10000,
>     'request_timeout_in_ms': 10000,
>     'truncate_request_timeout_in_ms': 10000,
>     'write_request_timeout_in_ms': 10000}
> cassandra.policies: INFO: Using datacenter 'dc1' for DCAwareRoundRobinPolicy 
> (via host '127.0.0.1'); if incorrect, please specify a local_dc to the 
> constructor, or limit contact points to local cluster nodes
> cassandra.cluster: INFO: New Cassandra host <Host: 127.0.0.1 dc1> discovered
> --------------------- >> end captured logging << ---------------------
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
>     testMethod()
>   File "/home/automaton/cassandra-dtest/snitch_test.py", line 87, in 
> test_prefer_local_reconnect_on_listen_address
>     new_rows = list(session.execute("SELECT * FROM {}".format(stress_table)))
>   File "/home/automaton/src/cassandra-driver/cassandra/cluster.py", line 
> 1998, in execute
>     return self.execute_async(query, parameters, trace, custom_payload, 
> timeout, execution_profile, paging_state).result()
>   File "/home/automaton/src/cassandra-driver/cassandra/cluster.py", line 
> 3784, in result
>     raise self._final_exception
> 'Error from server: code=2200 [Invalid query] message="keyspace keyspace1 
> does not exist"\n-------------------- >> begin captured logging << 
> --------------------\ndtest: DEBUG: cluster ccm directory: 
> /tmp/dtest-k6b0iF\ndtest: DEBUG: Done setting configuration options:\n{   
> \'initial_token\': None,\n    \'num_tokens\': \'32\',\n    
> \'phi_convict_threshold\': 5,\n    \'range_request_timeout_in_ms\': 10000,\n  
>   \'read_request_timeout_in_ms\': 10000,\n    \'request_timeout_in_ms\': 
> 10000,\n    \'truncate_request_timeout_in_ms\': 10000,\n    
> \'write_request_timeout_in_ms\': 10000}\ncassandra.policies: INFO: Using 
> datacenter \'dc1\' for DCAwareRoundRobinPolicy (via host \'127.0.0.1\'); if 
> incorrect, please specify a local_dc to the constructor, or limit contact 
> points to local cluster nodes\ncassandra.cluster: INFO: New Cassandra host 
> <Host: 127.0.0.1 dc1> discovered\n--------------------- >> end captured 
> logging << ---------------------'
> {novnode}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to