[ https://issues.apache.org/jira/browse/CASSANDRA-13196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952407#comment-15952407 ]
Aleksandr Sorokoumov commented on CASSANDRA-13196: -------------------------------------------------- Hi Jeff, Thank you for the feedback and the reference to CASSANDRA-11748, I agree that blind retry might not be the best approach here. I checked available logs once again, compared them against the logs of the passing tests and found the following details. Here are the selected parts including 2 out of 3 failing migration tasks (node2_debug.log): {NOFORMAT} DEBUG [GossipStage:1] 2017-02-06 22:13:17,494 Gossiper.java:1524 - Received a regular ack from /127.0.0.3, can now exit shadow round DEBUG [RequestResponseStage-1] 2017-02-06 22:13:17,544 Gossiper.java:1013 - removing expire time for endpoint : /127.0.0.3 INFO [RequestResponseStage-1] 2017-02-06 22:13:17,544 Gossiper.java:1014 - InetAddress /127.0.0.3 is now UP ... DEBUG [MessagingService-Outgoing-/127.0.0.3-Small] 2017-02-06 22:13:19,540 OutboundTcpConnection.java:389 - Attempting to connect to /127.0.0.3 INFO [HANDSHAKE-/127.0.0.3] 2017-02-06 22:13:19,542 OutboundTcpConnection.java:502 - Handshaking version with /127.0.0.3 DEBUG [MessagingService-Outgoing-/127.0.0.3-Small] 2017-02-06 22:13:19,547 OutboundTcpConnection.java:474 - Done connecting to /127.0.0.3 INFO [ScheduledTasks:1] 2017-02-06 22:13:20,175 TokenMetadata.java:498 - Updating topology for all endpoints that have changed DEBUG [GossipStage:1] 2017-02-06 22:13:20,582 FailureDetector.java:457 - Ignoring interval time of 3091069037 for /127.0.0.3 INFO [GossipStage:1] 2017-02-06 22:13:20,583 Gossiper.java:1050 - Node /127.0.0.3 is now part of the cluster DEBUG [GossipStage:1] 2017-02-06 22:13:20,585 StorageService.java:2183 - Node /127.0.0.3 state NORMAL, token [-1004975124483382337, -1043780946254161012, -1171220283294501102, -1633901128898670626, -2841540313718334770, -3101553897858755861, -3621837948233406461, -3885385423954957338, -4732725481887392373, -6179581950298736502, -6981606078877819247, -7346004435634569033, -8212720784299798818, -8540770738419743277, -9144476684623172343, -9177835532129132670, 2008413755392462902, 2811132152336883532, 3686445661174337927, 3782543114949044043, 4077664166716898608, 4223225346596811804, 4829686036948015467, 5983186107673554538, 6452742755557526311, 6973900262646615849, 7254341221965284892, 8057149706767269849, 8062244717118882837, 8473826416444317568, 8957680003056564562, 9045604815824506670] ... DEBUG [GossipStage:1] 2017-02-06 22:13:20,622 MigrationManager.java:85 - Submitting migration task for /127.0.0.3 INFO [GossipStage:1] 2017-02-06 22:13:20,623 TokenMetadata.java:479 - Updating topology for /127.0.0.3 DEBUG [MigrationStage:1] 2017-02-06 22:13:20,623 MigrationTask.java:74 - Can't send schema pull request: node /127.0.0.3 is down. INFO [GossipStage:1] 2017-02-06 22:13:20,628 TokenMetadata.java:479 - Updating topology for /127.0.0.3 DEBUG [GossipStage:1] 2017-02-06 22:13:20,631 MigrationManager.java:85 - Submitting migration task for /127.0.0.3 DEBUG [MigrationStage:1] 2017-02-06 22:13:20,631 MigrationTask.java:74 - Can't send schema pull request: node /127.0.0.3 is down. ... DEBUG [RequestResponseStage-1] 2017-02-06 22:13:20,665 Gossiper.java:1013 - removing expire time for endpoint : /127.0.0.3 INFO [RequestResponseStage-1] 2017-02-06 22:13:20,665 Gossiper.java:1014 - InetAddress /127.0.0.3 is now UP {NOFORMAT} Following the reconnect ("Attempting to connect to /127.0.0.3") the migration tasks were submitted after the endpoint is considered NORMAL by the StorageService, but before it was marked alive by Gossiper.realMarkAlive. Can there be a rare race condition based on the order of these actions? Can you please tell me if this might make sense or provide with any suggestions on what to look at? Thanks, Aleksandr > test failure in > snitch_test.TestGossipingPropertyFileSnitch.test_prefer_local_reconnect_on_listen_address > --------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-13196 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13196 > Project: Cassandra > Issue Type: Bug > Reporter: Michael Shuler > Assignee: Aleksandr Sorokoumov > Labels: dtest, test-failure > Attachments: node1_debug.log, node1_gc.log, node1.log, > node2_debug.log, node2_gc.log, node2.log > > > example failure: > http://cassci.datastax.com/job/trunk_dtest/1487/testReport/snitch_test/TestGossipingPropertyFileSnitch/test_prefer_local_reconnect_on_listen_address > {code} > {novnode} > Error Message > Error from server: code=2200 [Invalid query] message="keyspace keyspace1 does > not exist" > -------------------- >> begin captured logging << -------------------- > dtest: DEBUG: cluster ccm directory: /tmp/dtest-k6b0iF > dtest: DEBUG: Done setting configuration options: > { 'initial_token': None, > 'num_tokens': '32', > 'phi_convict_threshold': 5, > 'range_request_timeout_in_ms': 10000, > 'read_request_timeout_in_ms': 10000, > 'request_timeout_in_ms': 10000, > 'truncate_request_timeout_in_ms': 10000, > 'write_request_timeout_in_ms': 10000} > cassandra.policies: INFO: Using datacenter 'dc1' for DCAwareRoundRobinPolicy > (via host '127.0.0.1'); if incorrect, please specify a local_dc to the > constructor, or limit contact points to local cluster nodes > cassandra.cluster: INFO: New Cassandra host <Host: 127.0.0.1 dc1> discovered > --------------------- >> end captured logging << --------------------- > Stacktrace > File "/usr/lib/python2.7/unittest/case.py", line 329, in run > testMethod() > File "/home/automaton/cassandra-dtest/snitch_test.py", line 87, in > test_prefer_local_reconnect_on_listen_address > new_rows = list(session.execute("SELECT * FROM {}".format(stress_table))) > File "/home/automaton/src/cassandra-driver/cassandra/cluster.py", line > 1998, in execute > return self.execute_async(query, parameters, trace, custom_payload, > timeout, execution_profile, paging_state).result() > File "/home/automaton/src/cassandra-driver/cassandra/cluster.py", line > 3784, in result > raise self._final_exception > 'Error from server: code=2200 [Invalid query] message="keyspace keyspace1 > does not exist"\n-------------------- >> begin captured logging << > --------------------\ndtest: DEBUG: cluster ccm directory: > /tmp/dtest-k6b0iF\ndtest: DEBUG: Done setting configuration options:\n{ > \'initial_token\': None,\n \'num_tokens\': \'32\',\n > \'phi_convict_threshold\': 5,\n \'range_request_timeout_in_ms\': 10000,\n > \'read_request_timeout_in_ms\': 10000,\n \'request_timeout_in_ms\': > 10000,\n \'truncate_request_timeout_in_ms\': 10000,\n > \'write_request_timeout_in_ms\': 10000}\ncassandra.policies: INFO: Using > datacenter \'dc1\' for DCAwareRoundRobinPolicy (via host \'127.0.0.1\'); if > incorrect, please specify a local_dc to the constructor, or limit contact > points to local cluster nodes\ncassandra.cluster: INFO: New Cassandra host > <Host: 127.0.0.1 dc1> discovered\n--------------------- >> end captured > logging << ---------------------' > {novnode} > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)