[ 
https://issues.apache.org/jira/browse/CASSANDRA-7816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368543#comment-14368543
 ] 

Stefania commented on CASSANDRA-7816:
-------------------------------------

Unfortunately I could not reproduce it, I tried with a ccm cluster of 5, 10 and 
15 nodes. Not sure if you were doing any specific operations so I tried 
restarting or adding new nodes but the status reported by nodetool, on all 
hosts, was always correct as "UN".

However, by code inspection, the problem described could happen if a node fails 
to reply to an echo message after it has gossiped it's status as alive. Maybe 
the socket listening thread was slow to start due to machine overload or maybe 
some other cluster properties could explain it. I did not waste too much time 
trying to understand what I could not reproduce. Instead, I went ahead and 
created a new delta patch that should fix it: 
https://github.com/stef1927/cassandra/tree/7816-2.

What this patch does is revert back the part of the code that I think is 
causing the issue in favor of a more conservative approach that simply stores 
the last state reported to the client in {{Server.EventNotifier}} and does not 
interfere with the existing {{markAlive()}} logic in {{Gossiper}}. It does 
introduce a new problem, in that the additional map may consume extra memory, 
but we can worry about this during code review if this patch works.

So would you mind applying this new patch and see if it solves it? If not, 
could you please give me more information on your cluster or give me access to 
it? You could also try to reproduce it in TRACE mode and send me the logs.

> Duplicate DOWN/UP Events Pushed with Native Protocol
> ----------------------------------------------------
>
>                 Key: CASSANDRA-7816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: API
>            Reporter: Michael Penick
>            Assignee: Stefania
>            Priority: Minor
>             Fix For: 2.1.4, 2.0.14
>
>         Attachments: 7816-v2.0.txt, tcpdump_repeating_status_change.txt, 
> trunk-7816.txt
>
>
> Added "MOVED_NODE" as a possible type of topology change and also specified 
> that it is possible to receive the same event multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to