[
https://issues.apache.org/jira/browse/CASSANDRA-7816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368543#comment-14368543
]
Stefania commented on CASSANDRA-7816:
-------------------------------------
Unfortunately I could not reproduce it, I tried with a ccm cluster of 5, 10 and
15 nodes. Not sure if you were doing any specific operations so I tried
restarting or adding new nodes but the status reported by nodetool, on all
hosts, was always correct as "UN".
However, by code inspection, the problem described could happen if a node fails
to reply to an echo message after it has gossiped it's status as alive. Maybe
the socket listening thread was slow to start due to machine overload or maybe
some other cluster properties could explain it. I did not waste too much time
trying to understand what I could not reproduce. Instead, I went ahead and
created a new delta patch that should fix it:
https://github.com/stef1927/cassandra/tree/7816-2.
What this patch does is revert back the part of the code that I think is
causing the issue in favor of a more conservative approach that simply stores
the last state reported to the client in {{Server.EventNotifier}} and does not
interfere with the existing {{markAlive()}} logic in {{Gossiper}}. It does
introduce a new problem, in that the additional map may consume extra memory,
but we can worry about this during code review if this patch works.
So would you mind applying this new patch and see if it solves it? If not,
could you please give me more information on your cluster or give me access to
it? You could also try to reproduce it in TRACE mode and send me the logs.
> Duplicate DOWN/UP Events Pushed with Native Protocol
> ----------------------------------------------------
>
> Key: CASSANDRA-7816
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7816
> Project: Cassandra
> Issue Type: Bug
> Components: API
> Reporter: Michael Penick
> Assignee: Stefania
> Priority: Minor
> Fix For: 2.1.4, 2.0.14
>
> Attachments: 7816-v2.0.txt, tcpdump_repeating_status_change.txt,
> trunk-7816.txt
>
>
> Added "MOVED_NODE" as a possible type of topology change and also specified
> that it is possible to receive the same event multiple times.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)