[ 
https://issues.apache.org/jira/browse/CASSANDRA-13196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952462#comment-15952462
 ] 

Jeff Jirsa commented on CASSANDRA-13196:
----------------------------------------

Wouldn't be surprised if there was a race condition there - there was a 
not-dissimilar race solved recently in CASSANDRA-12653 where the race was in 
setting up token metadata as the node came out of shadow round, and this is 
fairly similar - we come out of shadow round at {{2017-02-06 22:13:17,494}} , 
we submit the migration tasks at {{2017-02-06 22:13:20,622}} and immediately 
{{2017-02-06 22:13:20,623}} decide not to send it, and FD finally sees the 
nodes come up at {{2017-02-06 22:13:20,665}} - at the very least, I'm not sure 
why we'd even try to submit the migration task knowing the instance was down - 
requeueing the schema pull immediately on failure here definitely wouldn't have 
helped (we'd have failed to send it again, as the instance was still down).

I'm not sure of the history here, but it seems like 
[MigrationManager#shouldPullSchemaFrom|https://github.com/Gerrrr/cassandra/blob/463f3fecd9348ea0a4ce6eeeb30141527b8b10eb/src/java/org/apache/cassandra/schema/MigrationManager.java#L125]
 could potentially check that endpoint's UP/DOWN in addition to messaging 
version.  


> test failure in 
> snitch_test.TestGossipingPropertyFileSnitch.test_prefer_local_reconnect_on_listen_address
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13196
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13196
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Michael Shuler
>            Assignee: Aleksandr Sorokoumov
>              Labels: dtest, test-failure
>         Attachments: node1_debug.log, node1_gc.log, node1.log, 
> node2_debug.log, node2_gc.log, node2.log
>
>
> example failure:
> http://cassci.datastax.com/job/trunk_dtest/1487/testReport/snitch_test/TestGossipingPropertyFileSnitch/test_prefer_local_reconnect_on_listen_address
> {code}
> {novnode}
> Error Message
> Error from server: code=2200 [Invalid query] message="keyspace keyspace1 does 
> not exist"
> -------------------- >> begin captured logging << --------------------
> dtest: DEBUG: cluster ccm directory: /tmp/dtest-k6b0iF
> dtest: DEBUG: Done setting configuration options:
> {   'initial_token': None,
>     'num_tokens': '32',
>     'phi_convict_threshold': 5,
>     'range_request_timeout_in_ms': 10000,
>     'read_request_timeout_in_ms': 10000,
>     'request_timeout_in_ms': 10000,
>     'truncate_request_timeout_in_ms': 10000,
>     'write_request_timeout_in_ms': 10000}
> cassandra.policies: INFO: Using datacenter 'dc1' for DCAwareRoundRobinPolicy 
> (via host '127.0.0.1'); if incorrect, please specify a local_dc to the 
> constructor, or limit contact points to local cluster nodes
> cassandra.cluster: INFO: New Cassandra host <Host: 127.0.0.1 dc1> discovered
> --------------------- >> end captured logging << ---------------------
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
>     testMethod()
>   File "/home/automaton/cassandra-dtest/snitch_test.py", line 87, in 
> test_prefer_local_reconnect_on_listen_address
>     new_rows = list(session.execute("SELECT * FROM {}".format(stress_table)))
>   File "/home/automaton/src/cassandra-driver/cassandra/cluster.py", line 
> 1998, in execute
>     return self.execute_async(query, parameters, trace, custom_payload, 
> timeout, execution_profile, paging_state).result()
>   File "/home/automaton/src/cassandra-driver/cassandra/cluster.py", line 
> 3784, in result
>     raise self._final_exception
> 'Error from server: code=2200 [Invalid query] message="keyspace keyspace1 
> does not exist"\n-------------------- >> begin captured logging << 
> --------------------\ndtest: DEBUG: cluster ccm directory: 
> /tmp/dtest-k6b0iF\ndtest: DEBUG: Done setting configuration options:\n{   
> \'initial_token\': None,\n    \'num_tokens\': \'32\',\n    
> \'phi_convict_threshold\': 5,\n    \'range_request_timeout_in_ms\': 10000,\n  
>   \'read_request_timeout_in_ms\': 10000,\n    \'request_timeout_in_ms\': 
> 10000,\n    \'truncate_request_timeout_in_ms\': 10000,\n    
> \'write_request_timeout_in_ms\': 10000}\ncassandra.policies: INFO: Using 
> datacenter \'dc1\' for DCAwareRoundRobinPolicy (via host \'127.0.0.1\'); if 
> incorrect, please specify a local_dc to the constructor, or limit contact 
> points to local cluster nodes\ncassandra.cluster: INFO: New Cassandra host 
> <Host: 127.0.0.1 dc1> discovered\n--------------------- >> end captured 
> logging << ---------------------'
> {novnode}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to