[ 
https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370969#comment-16370969
 ] 

Joseph Lynch commented on CASSANDRA-13993:
------------------------------------------

[~jasobrown] this is a great idea and would definitely help with the pain of 
rolling restarts! I'm curious though about the choice to make this a percentage 
instead of a raw count either for only the local datacenter or for each 
datacenter (e.g. block until N nodes or fewer are marked down in this nodes 
local datacenter)? In particular I'm concerned that in typical setups (maybe 
something like 2 datacenters, <60 nodes, RF=3, mostly 
{{NetworkTopologyStrategy}} keyspaces) having anything more than one node down 
in gossip in the same datacenter will mean a high probability of getting 
unavailable exceptions @ {{LOCAL_QUORUM}} or timeouts, especially with 
vnode=256 clusters where any 2 nodes down in different racks essentially 
guarantees an unavailable error for some intersecting token range.

What if instead of a percentage the system waited for a fixed number (or fewer) 
of endpoints to be marked as down in the local datacenter, such as 1 by 
default, relying on the timeout for large clusters (although it would be 
awesome if this timeout re-used or defaulted to an existing timeout relevant to 
gossip convergence such as {{BROADCAST_INTERVAL}} or {{RING_DELAY}}).

What do you think? I worked up a quick proof of concept implementation that 
implements counts for the local DC, each DC, or all DCs (for users that are 
using {{LOCAL_QUORUM}} vs {{EACH_QUORUM}} vs {{QUORUM}}) over on 
[github|https://github.com/jasobrown/cassandra/compare/13993...jolynch:13993] 
to show kind of what I'm thinking. I didn't fix the unit tests but if you think 
it's a good idea I can fix them up and add some more (that test multi dc 
setups).

I guess that to make it even smarter a previous {{CassandraDaemon.stop}} could 
persist how many {{DOWN}} nodes there were in a local table or some such and 
then the {{CassandraDaemon.start}} waits for the maximum of that persisted 
number and the configured default, but that adds more complexity and given the 
flexibility of the three counts I am not sure it's worth it.

> Add optional startup delay to wait until peers are ready
> --------------------------------------------------------
>
>                 Key: CASSANDRA-13993
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13993
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Lifecycle
>            Reporter: Jason Brown
>            Assignee: Jason Brown
>            Priority: Minor
>             Fix For: 4.x
>
>
> When bouncing a node in a large cluster, is can take a while to recognize the 
> rest of the cluster as available. This is especially true if using TLS on 
> internode messaging connections. The bouncing node (and any clients connected 
> to it) may see a series of Unavailable or Timeout exceptions until the node 
> is 'warmed up' as connecting to the rest of the cluster is asynchronous from 
> the rest of the startup process.
> There are two aspects that drive a node's ability to successfully communicate 
> with a peer after a bounce:
> - marking the peer as 'alive' (state that is held in gossip). This affects 
> the unavailable exceptions
> - having both open outbound and inbound connections open and ready to each 
> peer. This affects timeouts.
> Details of each of these mechanisms are described in the comments below.
> This ticket proposes adding a mechanism, optional and configurable, to delay 
> opening the client native protocol port until some percentage of the peers in 
> the cluster is marked alive and connected to/from. Thus while we potentially 
> slow down startup (delay opening the client port), we alleviate the chance 
> that queries made by clients don't hit transient unavailable/timeout 
> exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to