[jira] [Commented] (CASSANDRA-10052) Bringing one node down, makes the whole cluster go down for a second

Stefania (JIRA) Tue, 01 Sep 2015 23:26:32 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726830#comment-14726830
 ]


Stefania commented on CASSANDRA-10052:
--------------------------------------

Thanks for your suggestion [~thobbs]. So what I did is to suppress the status 
change (UP/DOWN) and topology change (MOVE/JOIN/LEAVE) notifications if the 
endpoint broadcast rpc address is the same as our {{broadcast_rpc_address}} and 
yet the endpoint is not the local node. There is no need to check 
{{rpc_address}} as well because the broadcast rpc address is set to the rpc 
address when the user does not specify it and it is the broadcast rpc address 
that is sent over Gossip. Relevant code in {{SS.getRpcAddress()}}.

The code looks something like this:

{code}
if (!endpoint.equals(FBUtilities.getBroadcastAddress()) &&
    event.nodeAddress().equals(DatabaseDescriptor.getBroadcastRpcAddress()))
    return;
{code}

where {{event.nodeAddress()}} is the rpc broadcast address of the endpoint, as 
returned by {{Server.getRpcAddress(endpoint)}}, which is unchanged.

CI results for 2.1. will appear here:

http://cassci.datastax.com/job/stef1927-10052-2.1-testall/
http://cassci.datastax.com/job/stef1927-10052-2.1-dtest/

Do you have time to review or would you like to suggest someone else?

If this solution is OK I will then merge into the 2.2+ branches to run CI there.

> Bringing one node down, makes the whole cluster go down for a second
> --------------------------------------------------------------------
>
>                 Key: CASSANDRA-10052
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10052
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sharvanath Pathak
>            Assignee: Stefania
>              Labels: client-impacting
>             Fix For: 2.1.x, 2.2.x
>
>
> When a node goes down, the other nodes learn that through the gossip.
> And I do see the log from (Gossiper.java):
> {code}
> private void markDead(InetAddress addr, EndpointState localState)
>    {
>        if (logger.isTraceEnabled())
>            logger.trace("marking as down {}", addr);
>        localState.markDead();
>        liveEndpoints.remove(addr);
>        unreachableEndpoints.put(addr, System.nanoTime());
>        logger.info("InetAddress {} is now DOWN", addr);
>        for (IEndpointStateChangeSubscriber subscriber : subscribers)
>            subscriber.onDead(addr, localState);
>        if (logger.isTraceEnabled())
>            logger.trace("Notified " + subscribers);
>    }
> {code}
> Saying: "InetAddress 192.168.101.1 is now Down", in the Cassandra's system 
> log.
> Now on all the other nodes the client side (java driver) says, " Cannot 
> connect to any host, scheduling retry in 1000 milliseconds". They eventually 
> do reconnect but some queries fail during this intermediate period.
> To me it seems like when the server pushes the nodeDown event, it call the 
> getRpcAddress(endpoint), and thus sends localhost as the argument in the 
> nodeDown event.  
> As in org.apache.cassandra.transport.Server.java
> {code}
>   public void onDown(InetAddress endpoint)
>        {      
>            
> server.connectionTracker.send(Event.StatusChange.nodeDown(getRpcAddress(endpoint),
>  server.socket.getPort()));
>        }
> {code}
> the getRpcAddress returns localhost for any endpoint if the cassandra.yaml is 
> using localhost as the configuration for rpc_address (which by the way is the 
> default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10052) Bringing one node down, makes the whole cluster go down for a second

Reply via email to