[ 
https://issues.apache.org/jira/browse/CASSANDRA-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021191#comment-14021191
 ] 

Sylvain Lebresne commented on CASSANDRA-7012:
---------------------------------------------

I think the proper fix for this is really to have the notification send when 
everything is ready, though I don't remember why that was not entirely trivial 
(but it sounds to me that having a point in the code where we know that a node 
is ready to process query is something we should have).

That said, I do think we should persist UP/DOWN states in the system tables for 
other reasons. For example, there is currently a nasty case where client could 
get an unavailable exception for queries without good reason and that's if a 
node (or a small amount of nodes) gets isolated from the rest of the cluster 
for some time without being dead. Once the network is re-established, other 
nodes might mark the isolated node UP before the isolated node has marked the 
rest of the cluster UP. In that case, the driver can get a notification that 
the once-isolated node is now back and could connect to it (which will 
succeed). There is now a window of time where if if the node is used for 
querying, it will return unavailable because it will think all the other nodes 
(or a large part of them at least) are down. I'll note that this is not really 
hard to reproduce.

If we were to persist up/down states, the driver could implement something 
along the lines of: if a new node is advertised as UP, check it's local view of 
the cluster. If that view contradict too much what we know, then stale a bit 
before using that node for queries.

Besides, I do think having simple access to how each node view the cluster can 
be useful for other tools. So unless we have a technical reason not to expose 
them, I do am in favor of doing it, even if that's somewhat orthogonal to 
making notifications be sent at the proper time.

> Expose node status through the system tables, especially after the native 
> protocol is active
> --------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7012
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7012
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Joaquin Casares
>            Assignee: Tyler Hobbs
>              Labels: datastax_qa
>
> Java-Driver's note on the issue: 
> https://github.com/datastax/java-driver/blob/2.1/driver-core/src/main/java/com/datastax/driver/core/Cluster.java#L1087
> What the tests for the drivers (Java, Python, C#, etc..) see is a need for 
> sleeps to cover the race conditions between when isUp() returns true and the 
> nodes are actually ready to be accessed.
> Could we instead, at the very end of the startup process when the native 
> protocol is up and active, have the system tables be written to denote that 
> the node is now UP and active?
> If writing to the system tables is not the best idea, could we figure out 
> another solution to get rid of this race condition, thereby simplifying the 
> testing of the drivers and removing test cases riddled with sleeps of up to 
> 40 seconds?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to