[ https://issues.apache.org/jira/browse/CASSANDRA-12781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593950#comment-15593950 ]
Stefania commented on CASSANDRA-12781: -------------------------------------- The healthy DOWN and UP notifications are all separated by just over 20 seconds. When we receive the extra DOWN notification, the UP notification follows within less than a second from the DOWN notification: {code} dtest: DEBUG: Got {'change_type': u'NEW_NODE', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:16:57.887774 dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:16:57.888112 dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:16:57.934182 dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:17:18.734717 dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:17:18.810279 dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:17:38.922725 dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:17:38.983538 dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:17:59.823012 dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:17:59.861144 dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:18:20.984716 dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:18:21.038949 dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:18:21.986869 dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 127.0.0.1 at 2016-10-05 04:18:33.047814 {code} So the problem is a bogus UP notification, which was caused by a GOSSIP message that caused {{realMarkAlive()}} to be called (note the {{is now UP}} log message, which is logged by {{realMarkAlive()}}: {code} INFO [GossipStage:1] 2016-10-05 04:18:21,037 Gossiper.java:1034 - InetAddress /127.0.0.2 is now DOWN DEBUG [GossipStage:1] 2016-10-05 04:18:21,037 OutboundTcpConnection.java:179 - Enqueuing socket close for /127.0.0.2 DEBUG [GossipStage:1] 2016-10-05 04:18:21,037 OutboundTcpConnection.java:179 - Enqueuing socket close for /127.0.0.2 DEBUG [GossipStage:1] 2016-10-05 04:18:21,037 OutboundTcpConnection.java:179 - Enqueuing socket close for /127.0.0.2 DEBUG [MessagingService-Outgoing-/127.0.0.2-Small] 2016-10-05 04:18:21,038 OutboundTcpConnection.java:373 - Socket to /127.0.0.2 closed DEBUG [GossipStage:1] 2016-10-05 04:18:21,038 FailureDetector.java:323 - Forcing conviction of /127.0.0.2 DEBUG [MessagingService-Outgoing-/127.0.0.2-Gossip] 2016-10-05 04:18:21,039 OutboundTcpConnection.java:373 - Socket to /127.0.0.2 closed DEBUG [MessagingService-Outgoing-/127.0.0.2-Gossip] 2016-10-05 04:18:21,039 OutboundTcpConnection.java:387 - Attempting to connect to /127.0.0.2 INFO [HANDSHAKE-/127.0.0.2] 2016-10-05 04:18:21,045 OutboundTcpConnection.java:512 - Handshaking version with /127.0.0.2 DEBUG [MessagingService-Outgoing-/127.0.0.2-Gossip] 2016-10-05 04:18:21,046 OutboundTcpConnection.java:481 - Done connecting to /127.0.0.2 DEBUG [RequestResponseStage-1] 2016-10-05 04:18:21,984 Gossiper.java:1018 - removing expire time for endpoint : /127.0.0.2 INFO [RequestResponseStage-1] 2016-10-05 04:18:21,984 Gossiper.java:1019 - InetAddress /127.0.0.2 is now UP {code} Since CASSANDRA-8236, {{realMarkAlive()}} won't generate an UP notification if {{RPC_READY}} is false (see {{SS.notifyUp()}}). Therefore my guess is that the problem is that {{RPC_READY}} is only set to false in {{CassandraDaemon.stop()}}, not when the RPC services are stopped by {{SS.drain()}}, which is in {{CassandraDaemon.shutdownClientServers()}}. My plan is to see how easily we can reproduce this in a multiplexed run, and if we can reproduce it then we can test the assumption above. If we cannot reproduce it, we should still set {{RPC_READY}} to false in {{CassandraDaemon.shutdownClientServers()}}, I think this should be safe even for 3.0, and is the correct thing to do. > dtest failure in > pushed_notifications_test.TestPushedNotifications.restart_node_test > ------------------------------------------------------------------------------------ > > Key: CASSANDRA-12781 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12781 > Project: Cassandra > Issue Type: Bug > Reporter: Sean McCarthy > Assignee: Stefania > Labels: dtest > Fix For: 3.x > > Attachments: node1.log, node1_debug.log, node1_gc.log, node2.log, > node2_debug.log, node2_gc.log > > > example failure: > http://cassci.datastax.com/job/cassandra-3.X_dtest/4/testReport/pushed_notifications_test/TestPushedNotifications/restart_node_test > {code} > Error Message > [{'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)}, {'change_type': > u'UP', 'address': ('127.0.0.2', 9042)}, {'change_type': u'DOWN', 'address': > ('127.0.0.2', 9042)}] > {code} > {code} > Stacktrace > File "/usr/lib/python2.7/unittest/case.py", line 329, in run > testMethod() > File "/home/automaton/cassandra-dtest/pushed_notifications_test.py", line > 181, in restart_node_test > self.assertEquals(expected_notifications, len(notifications), > notifications) > File "/usr/lib/python2.7/unittest/case.py", line 513, in assertEqual > assertion_func(first, second, msg=msg) > File "/usr/lib/python2.7/unittest/case.py", line 506, in _baseAssertEqual > raise self.failureException(msg) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)