[ 
https://issues.apache.org/jira/browse/CASSANDRA-13196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952407#comment-15952407
 ] 

Aleksandr Sorokoumov commented on CASSANDRA-13196:
--------------------------------------------------

Hi Jeff,

Thank you for the feedback and the reference to CASSANDRA-11748, I agree that 
blind retry might not be the best approach here. I checked available logs once 
again, compared them against the logs of the passing tests and found the 
following details. Here are the selected parts including 2 out of 3 failing 
migration tasks (node2_debug.log):

{NOFORMAT}
DEBUG [GossipStage:1] 2017-02-06 22:13:17,494 Gossiper.java:1524 - Received a 
regular ack from /127.0.0.3, can now exit shadow round
DEBUG [RequestResponseStage-1] 2017-02-06 22:13:17,544 Gossiper.java:1013 - 
removing expire time for endpoint : /127.0.0.3
INFO  [RequestResponseStage-1] 2017-02-06 22:13:17,544 Gossiper.java:1014 - 
InetAddress /127.0.0.3 is now UP

...

DEBUG [MessagingService-Outgoing-/127.0.0.3-Small] 2017-02-06 22:13:19,540 
OutboundTcpConnection.java:389 - Attempting to connect to /127.0.0.3
INFO  [HANDSHAKE-/127.0.0.3] 2017-02-06 22:13:19,542 
OutboundTcpConnection.java:502 - Handshaking version with /127.0.0.3
DEBUG [MessagingService-Outgoing-/127.0.0.3-Small] 2017-02-06 22:13:19,547 
OutboundTcpConnection.java:474 - Done connecting to /127.0.0.3
INFO  [ScheduledTasks:1] 2017-02-06 22:13:20,175 TokenMetadata.java:498 - 
Updating topology for all endpoints that have changed
DEBUG [GossipStage:1] 2017-02-06 22:13:20,582 FailureDetector.java:457 - 
Ignoring interval time of 3091069037 for /127.0.0.3
INFO  [GossipStage:1] 2017-02-06 22:13:20,583 Gossiper.java:1050 - Node 
/127.0.0.3 is now part of the cluster
DEBUG [GossipStage:1] 2017-02-06 22:13:20,585 StorageService.java:2183 - Node 
/127.0.0.3 state NORMAL, token [-1004975124483382337, -1043780946254161012, 
-1171220283294501102, -1633901128898670626, -2841540313718334770, 
-3101553897858755861, -3621837948233406461, -3885385423954957338, 
-4732725481887392373, -6179581950298736502, -6981606078877819247, 
-7346004435634569033, -8212720784299798818, -8540770738419743277, 
-9144476684623172343, -9177835532129132670, 2008413755392462902, 
2811132152336883532, 3686445661174337927, 3782543114949044043, 
4077664166716898608, 4223225346596811804, 4829686036948015467, 
5983186107673554538, 6452742755557526311, 6973900262646615849, 
7254341221965284892,
8057149706767269849, 8062244717118882837, 8473826416444317568, 
8957680003056564562, 9045604815824506670]

...

DEBUG [GossipStage:1] 2017-02-06 22:13:20,622 MigrationManager.java:85 - 
Submitting migration task for /127.0.0.3
INFO  [GossipStage:1] 2017-02-06 22:13:20,623 TokenMetadata.java:479 - Updating 
topology for /127.0.0.3
DEBUG [MigrationStage:1] 2017-02-06 22:13:20,623 MigrationTask.java:74 - Can't 
send schema pull request: node /127.0.0.3 is down.
INFO  [GossipStage:1] 2017-02-06 22:13:20,628 TokenMetadata.java:479 - Updating 
topology for /127.0.0.3
DEBUG [GossipStage:1] 2017-02-06 22:13:20,631 MigrationManager.java:85 - 
Submitting migration task for /127.0.0.3
DEBUG [MigrationStage:1] 2017-02-06 22:13:20,631 MigrationTask.java:74 - Can't 
send schema pull request: node /127.0.0.3 is down.
...
DEBUG [RequestResponseStage-1] 2017-02-06 22:13:20,665 Gossiper.java:1013 - 
removing expire time for endpoint : /127.0.0.3
INFO  [RequestResponseStage-1] 2017-02-06 22:13:20,665 Gossiper.java:1014 - 
InetAddress /127.0.0.3 is now UP
{NOFORMAT}

Following the reconnect ("Attempting to connect to /127.0.0.3") the migration 
tasks were submitted after the endpoint is considered NORMAL by the 
StorageService, but before it was marked alive by Gossiper.realMarkAlive. Can 
there be a rare race condition based on the order of these actions?
Can you please tell me if this might make sense or provide with any suggestions 
on what to look at?

Thanks,
Aleksandr

> test failure in 
> snitch_test.TestGossipingPropertyFileSnitch.test_prefer_local_reconnect_on_listen_address
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13196
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13196
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Michael Shuler
>            Assignee: Aleksandr Sorokoumov
>              Labels: dtest, test-failure
>         Attachments: node1_debug.log, node1_gc.log, node1.log, 
> node2_debug.log, node2_gc.log, node2.log
>
>
> example failure:
> http://cassci.datastax.com/job/trunk_dtest/1487/testReport/snitch_test/TestGossipingPropertyFileSnitch/test_prefer_local_reconnect_on_listen_address
> {code}
> {novnode}
> Error Message
> Error from server: code=2200 [Invalid query] message="keyspace keyspace1 does 
> not exist"
> -------------------- >> begin captured logging << --------------------
> dtest: DEBUG: cluster ccm directory: /tmp/dtest-k6b0iF
> dtest: DEBUG: Done setting configuration options:
> {   'initial_token': None,
>     'num_tokens': '32',
>     'phi_convict_threshold': 5,
>     'range_request_timeout_in_ms': 10000,
>     'read_request_timeout_in_ms': 10000,
>     'request_timeout_in_ms': 10000,
>     'truncate_request_timeout_in_ms': 10000,
>     'write_request_timeout_in_ms': 10000}
> cassandra.policies: INFO: Using datacenter 'dc1' for DCAwareRoundRobinPolicy 
> (via host '127.0.0.1'); if incorrect, please specify a local_dc to the 
> constructor, or limit contact points to local cluster nodes
> cassandra.cluster: INFO: New Cassandra host <Host: 127.0.0.1 dc1> discovered
> --------------------- >> end captured logging << ---------------------
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
>     testMethod()
>   File "/home/automaton/cassandra-dtest/snitch_test.py", line 87, in 
> test_prefer_local_reconnect_on_listen_address
>     new_rows = list(session.execute("SELECT * FROM {}".format(stress_table)))
>   File "/home/automaton/src/cassandra-driver/cassandra/cluster.py", line 
> 1998, in execute
>     return self.execute_async(query, parameters, trace, custom_payload, 
> timeout, execution_profile, paging_state).result()
>   File "/home/automaton/src/cassandra-driver/cassandra/cluster.py", line 
> 3784, in result
>     raise self._final_exception
> 'Error from server: code=2200 [Invalid query] message="keyspace keyspace1 
> does not exist"\n-------------------- >> begin captured logging << 
> --------------------\ndtest: DEBUG: cluster ccm directory: 
> /tmp/dtest-k6b0iF\ndtest: DEBUG: Done setting configuration options:\n{   
> \'initial_token\': None,\n    \'num_tokens\': \'32\',\n    
> \'phi_convict_threshold\': 5,\n    \'range_request_timeout_in_ms\': 10000,\n  
>   \'read_request_timeout_in_ms\': 10000,\n    \'request_timeout_in_ms\': 
> 10000,\n    \'truncate_request_timeout_in_ms\': 10000,\n    
> \'write_request_timeout_in_ms\': 10000}\ncassandra.policies: INFO: Using 
> datacenter \'dc1\' for DCAwareRoundRobinPolicy (via host \'127.0.0.1\'); if 
> incorrect, please specify a local_dc to the constructor, or limit contact 
> points to local cluster nodes\ncassandra.cluster: INFO: New Cassandra host 
> <Host: 127.0.0.1 dc1> discovered\n--------------------- >> end captured 
> logging << ---------------------'
> {novnode}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to