[
https://issues.apache.org/jira/browse/CASSANDRA-13196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aleksandr Sorokoumov updated CASSANDRA-13196:
---------------------------------------------
Reviewer: Alex Petrov
Status: Patch Available (was: Open)
The failure in the test ("keyspace keyspace1 does not exist") happened because
during the pre-bootstrap schema migration all the migration tasks failed to
complete and the node was bootstrapped with schema being out of sync.
{{MigrationManager.waitUntilReadyForBootstrap}} (which is invoked by
{{StorageService.waitForSchema}}) just waits for the inflight tasks to finish
and discards ones that take longer than {{MIGRATION_TASK_WAIT_IN_SECONDS}} to
complete.
Schema migration tasks are scheduled when there is a big change in an endpoint
state - it joins the cluster, becomes alive or its schema version has changed.
The idea is that it is safe to restart the migration task if it has timed out
because either the task will succeed on one of the next retries or will be
eventually killed by {{FailureDetector}} if the endpoint is marked as
unreachable.
AFAIU there will be at least one migration task per endpoint. With the retry
mechanism {{MigrationManager.waitUntilReadyForBootstrap}} will run until
migration tasks to all the reachable nodes succeed.
This means that either we will receive the migration data from at least one of
the nodes or all the nodes will be unreachable, but then the bootstrap is
supposed to fail anyway.
*Steps to reproduce*
To test the retry, I commented out sending reply in
{{org.apache.cassandra.schema.SchemaPullVerbHandler.doVerb}} and ran the
original
{{snitch_test.TestGossipingPropertyFileSnitch.test_prefer_local_reconnect_on_listen_address}}
test.
_NB:_ the test will run forever because without response the migration requests
timeout and then being restarted.
*Code*
https://github.com/Gerrrr/cassandra/tree/13196-3.11
*CI builds*:
* https://cassci.datastax.com/job/ifesdjeen-13196-trunk-dtest/
* https://cassci.datastax.com/job/ifesdjeen-13196-trunk-testall/
> test failure in
> snitch_test.TestGossipingPropertyFileSnitch.test_prefer_local_reconnect_on_listen_address
> ---------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-13196
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13196
> Project: Cassandra
> Issue Type: Bug
> Reporter: Michael Shuler
> Assignee: Aleksandr Sorokoumov
> Labels: dtest, test-failure
> Attachments: node1_debug.log, node1_gc.log, node1.log,
> node2_debug.log, node2_gc.log, node2.log
>
>
> example failure:
> http://cassci.datastax.com/job/trunk_dtest/1487/testReport/snitch_test/TestGossipingPropertyFileSnitch/test_prefer_local_reconnect_on_listen_address
> {code}
> {novnode}
> Error Message
> Error from server: code=2200 [Invalid query] message="keyspace keyspace1 does
> not exist"
> -------------------- >> begin captured logging << --------------------
> dtest: DEBUG: cluster ccm directory: /tmp/dtest-k6b0iF
> dtest: DEBUG: Done setting configuration options:
> { 'initial_token': None,
> 'num_tokens': '32',
> 'phi_convict_threshold': 5,
> 'range_request_timeout_in_ms': 10000,
> 'read_request_timeout_in_ms': 10000,
> 'request_timeout_in_ms': 10000,
> 'truncate_request_timeout_in_ms': 10000,
> 'write_request_timeout_in_ms': 10000}
> cassandra.policies: INFO: Using datacenter 'dc1' for DCAwareRoundRobinPolicy
> (via host '127.0.0.1'); if incorrect, please specify a local_dc to the
> constructor, or limit contact points to local cluster nodes
> cassandra.cluster: INFO: New Cassandra host <Host: 127.0.0.1 dc1> discovered
> --------------------- >> end captured logging << ---------------------
> Stacktrace
> File "/usr/lib/python2.7/unittest/case.py", line 329, in run
> testMethod()
> File "/home/automaton/cassandra-dtest/snitch_test.py", line 87, in
> test_prefer_local_reconnect_on_listen_address
> new_rows = list(session.execute("SELECT * FROM {}".format(stress_table)))
> File "/home/automaton/src/cassandra-driver/cassandra/cluster.py", line
> 1998, in execute
> return self.execute_async(query, parameters, trace, custom_payload,
> timeout, execution_profile, paging_state).result()
> File "/home/automaton/src/cassandra-driver/cassandra/cluster.py", line
> 3784, in result
> raise self._final_exception
> 'Error from server: code=2200 [Invalid query] message="keyspace keyspace1
> does not exist"\n-------------------- >> begin captured logging <<
> --------------------\ndtest: DEBUG: cluster ccm directory:
> /tmp/dtest-k6b0iF\ndtest: DEBUG: Done setting configuration options:\n{
> \'initial_token\': None,\n \'num_tokens\': \'32\',\n
> \'phi_convict_threshold\': 5,\n \'range_request_timeout_in_ms\': 10000,\n
> \'read_request_timeout_in_ms\': 10000,\n \'request_timeout_in_ms\':
> 10000,\n \'truncate_request_timeout_in_ms\': 10000,\n
> \'write_request_timeout_in_ms\': 10000}\ncassandra.policies: INFO: Using
> datacenter \'dc1\' for DCAwareRoundRobinPolicy (via host \'127.0.0.1\'); if
> incorrect, please specify a local_dc to the constructor, or limit contact
> points to local cluster nodes\ncassandra.cluster: INFO: New Cassandra host
> <Host: 127.0.0.1 dc1> discovered\n--------------------- >> end captured
> logging << ---------------------'
> {novnode}
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)