[ 
https://issues.apache.org/jira/browse/CASSANDRA-12781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593950#comment-15593950
 ] 

Stefania commented on CASSANDRA-12781:
--------------------------------------

The healthy DOWN and UP notifications are all separated by just over 20 
seconds. When we receive the extra DOWN notification, the UP notification 
follows within less than a second from the DOWN notification:

{code}
dtest: DEBUG: Got {'change_type': u'NEW_NODE', 'address': ('127.0.0.2', 9042)} 
from 127.0.0.1 at 2016-10-05 04:16:57.887774
dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 
127.0.0.1 at 2016-10-05 04:16:57.888112

dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 
127.0.0.1 at 2016-10-05 04:16:57.934182
dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 
127.0.0.1 at 2016-10-05 04:17:18.734717

dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 
127.0.0.1 at 2016-10-05 04:17:18.810279
dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 
127.0.0.1 at 2016-10-05 04:17:38.922725

dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 
127.0.0.1 at 2016-10-05 04:17:38.983538
dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 
127.0.0.1 at 2016-10-05 04:17:59.823012

dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 
127.0.0.1 at 2016-10-05 04:17:59.861144
dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 
127.0.0.1 at 2016-10-05 04:18:20.984716

dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 
127.0.0.1 at 2016-10-05 04:18:21.038949
dtest: DEBUG: Got {'change_type': u'UP', 'address': ('127.0.0.2', 9042)} from 
127.0.0.1 at 2016-10-05 04:18:21.986869
dtest: DEBUG: Got {'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)} from 
127.0.0.1 at 2016-10-05 04:18:33.047814
{code}

So the problem is a bogus UP notification, which was caused by a GOSSIP message 
that caused {{realMarkAlive()}} to be called (note
the {{is now UP}} log message, which is logged by {{realMarkAlive()}}:

{code}
INFO  [GossipStage:1] 2016-10-05 04:18:21,037 Gossiper.java:1034 - InetAddress 
/127.0.0.2 is now DOWN
DEBUG [GossipStage:1] 2016-10-05 04:18:21,037 OutboundTcpConnection.java:179 - 
Enqueuing socket close for /127.0.0.2
DEBUG [GossipStage:1] 2016-10-05 04:18:21,037 OutboundTcpConnection.java:179 - 
Enqueuing socket close for /127.0.0.2
DEBUG [GossipStage:1] 2016-10-05 04:18:21,037 OutboundTcpConnection.java:179 - 
Enqueuing socket close for /127.0.0.2
DEBUG [MessagingService-Outgoing-/127.0.0.2-Small] 2016-10-05 04:18:21,038 
OutboundTcpConnection.java:373 - Socket to /127.0.0.2 closed
DEBUG [GossipStage:1] 2016-10-05 04:18:21,038 FailureDetector.java:323 - 
Forcing conviction of /127.0.0.2
DEBUG [MessagingService-Outgoing-/127.0.0.2-Gossip] 2016-10-05 04:18:21,039 
OutboundTcpConnection.java:373 - Socket to /127.0.0.2 closed
DEBUG [MessagingService-Outgoing-/127.0.0.2-Gossip] 2016-10-05 04:18:21,039 
OutboundTcpConnection.java:387 - Attempting to connect to /127.0.0.2
INFO  [HANDSHAKE-/127.0.0.2] 2016-10-05 04:18:21,045 
OutboundTcpConnection.java:512 - Handshaking version with /127.0.0.2
DEBUG [MessagingService-Outgoing-/127.0.0.2-Gossip] 2016-10-05 04:18:21,046 
OutboundTcpConnection.java:481 - Done connecting to /127.0.0.2
DEBUG [RequestResponseStage-1] 2016-10-05 04:18:21,984 Gossiper.java:1018 - 
removing expire time for endpoint : /127.0.0.2
INFO  [RequestResponseStage-1] 2016-10-05 04:18:21,984 Gossiper.java:1019 - 
InetAddress /127.0.0.2 is now UP
{code}

Since CASSANDRA-8236, {{realMarkAlive()}} won't generate an UP notification if 
{{RPC_READY}} is false (see {{SS.notifyUp()}}). Therefore my guess is that the 
problem is that {{RPC_READY}} is only set to false in 
{{CassandraDaemon.stop()}}, not when the RPC services are stopped by 
{{SS.drain()}}, which is in {{CassandraDaemon.shutdownClientServers()}}.

My plan is to see how easily we can reproduce this in a multiplexed run, and if 
we can reproduce it then we can test the assumption above. If we cannot 
reproduce it, we should still set {{RPC_READY}} to false in 
{{CassandraDaemon.shutdownClientServers()}}, I think this should be safe even 
for 3.0, and is the correct thing to do.

> dtest failure in 
> pushed_notifications_test.TestPushedNotifications.restart_node_test
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-12781
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12781
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sean McCarthy
>            Assignee: Stefania
>              Labels: dtest
>             Fix For: 3.x
>
>         Attachments: node1.log, node1_debug.log, node1_gc.log, node2.log, 
> node2_debug.log, node2_gc.log
>
>
> example failure:
> http://cassci.datastax.com/job/cassandra-3.X_dtest/4/testReport/pushed_notifications_test/TestPushedNotifications/restart_node_test
> {code}
> Error Message
> [{'change_type': u'DOWN', 'address': ('127.0.0.2', 9042)}, {'change_type': 
> u'UP', 'address': ('127.0.0.2', 9042)}, {'change_type': u'DOWN', 'address': 
> ('127.0.0.2', 9042)}]
> {code}
> {code}
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
>     testMethod()
>   File "/home/automaton/cassandra-dtest/pushed_notifications_test.py", line 
> 181, in restart_node_test
>     self.assertEquals(expected_notifications, len(notifications), 
> notifications)
>   File "/usr/lib/python2.7/unittest/case.py", line 513, in assertEqual
>     assertion_func(first, second, msg=msg)
>   File "/usr/lib/python2.7/unittest/case.py", line 506, in _baseAssertEqual
>     raise self.failureException(msg)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to