[ 
https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-3533:
--------------------------------------

    Fix Version/s:     (was: 1.1.4)
                   1.2

bq. the connection just hasn't established yet, but will on OTC's next attempt

Is there anything forcing a next attempt though, besides gossip (1/N chance per 
round)?

bq.  Furthermore, in the case of natural, temporary partitions of this kind, 
there are some things we still want to retry instead of failing fast, like 
streaming

But you still have things like GC-based "flapping" that can cause FD to mark a 
node down over-pessimistically.  So I don't think I buy that this is an 
argument for not making FD more robust -- since we already have to deal with 
"FD is too pessimistic" for this case.

(Fundamentally though I don't think we'll get much mileage out of trying to 
second-guess FD, so I'd rather make FD as accurate as we can.  And I suspect 
that "StorageProxy uses FD-supplemented-by-X and the rest of the system using 
normal FD is going to cause weirdness.)

bq. we need to report new nodes in handleMajorStateChange that sends onJoin 
events, which cause the initial connection

I must be missing this -- as near as I can tell, a connection will be 
established when we try to send a message, and I don't see the "send message 
immediately on alive" [or on join] code.

bq. both of which are scary to put in a minor release

Agreed, retargetting to 1.2.
                
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Vijay
>            Assignee: Brandon Williams
>            Priority: Minor
>             Fix For: 1.2
>
>         Attachments: 3533.txt
>
>
> When one node in the cluster is not able to talk to the other DC/RAC due to 
> firewall or network related issue (StorageProxy calls fail), and the nodes 
> are NOT marked down because at least one node in the cluster can talk to the 
> other DC/RAC, we get timeoutException instead of throwing a 
> unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad 
> query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time 
> to know that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the 
> node is actually alive by trying to communicate to it? So we can be sure that 
> the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to