[
https://issues.apache.org/jira/browse/CASSANDRA-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021191#comment-14021191
]
Sylvain Lebresne commented on CASSANDRA-7012:
---------------------------------------------
I think the proper fix for this is really to have the notification send when
everything is ready, though I don't remember why that was not entirely trivial
(but it sounds to me that having a point in the code where we know that a node
is ready to process query is something we should have).
That said, I do think we should persist UP/DOWN states in the system tables for
other reasons. For example, there is currently a nasty case where client could
get an unavailable exception for queries without good reason and that's if a
node (or a small amount of nodes) gets isolated from the rest of the cluster
for some time without being dead. Once the network is re-established, other
nodes might mark the isolated node UP before the isolated node has marked the
rest of the cluster UP. In that case, the driver can get a notification that
the once-isolated node is now back and could connect to it (which will
succeed). There is now a window of time where if if the node is used for
querying, it will return unavailable because it will think all the other nodes
(or a large part of them at least) are down. I'll note that this is not really
hard to reproduce.
If we were to persist up/down states, the driver could implement something
along the lines of: if a new node is advertised as UP, check it's local view of
the cluster. If that view contradict too much what we know, then stale a bit
before using that node for queries.
Besides, I do think having simple access to how each node view the cluster can
be useful for other tools. So unless we have a technical reason not to expose
them, I do am in favor of doing it, even if that's somewhat orthogonal to
making notifications be sent at the proper time.
> Expose node status through the system tables, especially after the native
> protocol is active
> --------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-7012
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7012
> Project: Cassandra
> Issue Type: Improvement
> Components: API
> Reporter: Joaquin Casares
> Assignee: Tyler Hobbs
> Labels: datastax_qa
>
> Java-Driver's note on the issue:
> https://github.com/datastax/java-driver/blob/2.1/driver-core/src/main/java/com/datastax/driver/core/Cluster.java#L1087
> What the tests for the drivers (Java, Python, C#, etc..) see is a need for
> sleeps to cover the race conditions between when isUp() returns true and the
> nodes are actually ready to be accessed.
> Could we instead, at the very end of the startup process when the native
> protocol is up and active, have the system tables be written to denote that
> the node is now UP and active?
> If writing to the system tables is not the best idea, could we figure out
> another solution to get rid of this race condition, thereby simplifying the
> testing of the drivers and removing test cases riddled with sleeps of up to
> 40 seconds?
--
This message was sent by Atlassian JIRA
(v6.2#6252)