[jira] [Commented] (CASSANDRA-10052) Bringing one node down, makes the whole cluster go down for a second

Gaurav Jain (JIRA) Tue, 29 Sep 2015 15:37:05 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936033#comment-14936033
 ]


Gaurav Jain commented on CASSANDRA-10052:
-----------------------------------------

[~mshuler] Thanks for the tip. What would be the process to get this patch 
included in official 2.1.10 release? Also it seems this bug has not yet moved 
to the resolved state.

[~Stefania] thanks again for putting up this patch. Since it seems the testing 
was already done for 2.1, would it be possible for you or [~thobbs] to pull 
this into the 2.1 branches as well as the others?

> Bringing one node down, makes the whole cluster go down for a second
> --------------------------------------------------------------------
>
>                 Key: CASSANDRA-10052
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10052
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sharvanath Pathak
>            Assignee: Stefania
>              Labels: client-impacting
>             Fix For: 2.1.x
>
>
> When a node goes down, the other nodes learn that through the gossip.
> And I do see the log from (Gossiper.java):
> {code}
> private void markDead(InetAddress addr, EndpointState localState)
>    {
>        if (logger.isTraceEnabled())
>            logger.trace("marking as down {}", addr);
>        localState.markDead();
>        liveEndpoints.remove(addr);
>        unreachableEndpoints.put(addr, System.nanoTime());
>        logger.info("InetAddress {} is now DOWN", addr);
>        for (IEndpointStateChangeSubscriber subscriber : subscribers)
>            subscriber.onDead(addr, localState);
>        if (logger.isTraceEnabled())
>            logger.trace("Notified " + subscribers);
>    }
> {code}
> Saying: "InetAddress 192.168.101.1 is now Down", in the Cassandra's system 
> log.
> Now on all the other nodes the client side (java driver) says, " Cannot 
> connect to any host, scheduling retry in 1000 milliseconds". They eventually 
> do reconnect but some queries fail during this intermediate period.
> To me it seems like when the server pushes the nodeDown event, it call the 
> getRpcAddress(endpoint), and thus sends localhost as the argument in the 
> nodeDown event.  
> As in org.apache.cassandra.transport.Server.java
> {code}
>   public void onDown(InetAddress endpoint)
>        {      
>            
> server.connectionTracker.send(Event.StatusChange.nodeDown(getRpcAddress(endpoint),
>  server.socket.getPort()));
>        }
> {code}
> the getRpcAddress returns localhost for any endpoint if the cassandra.yaml is 
> using localhost as the configuration for rpc_address (which by the way is the 
> default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10052) Bringing one node down, makes the whole cluster go down for a second

Reply via email to