[ 
https://issues.apache.org/jira/browse/CASSANDRA-13327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950523#comment-15950523
 ] 

Sylvain Lebresne commented on CASSANDRA-13327:
----------------------------------------------

bq. so bootstrap can be resumed

Forgot we supported resuming now :)

bq. by making the read phase use an extended (RF + P + 1) / 2 quorum?

Reading from pending nodes is a very bad idea since by definition those nodes 
don't have up-to-date data.

Well, I guess things are working as they do for decently good reason here. That 
said, thinking about it, it could be that the solution from CASSANDRA-8346 is a 
bit of a big hammer: I believe it's enough to ensure that we read from at least 
one replica that responded to PREPARE 'in the same Paxos round' But we have 
timeouts on the paxos round, so it could be it is possible to reduce 
drastically the time we consider a node pending for CAS so that it's not a real 
problem in practice. Something like having pending node move to a "almost 
there" state before becoming true replica, and staying in that state for 
basically the max time of a paxos round, and then Paxos might be able to 
replace "pending" nodes by those "almost there" for PREPARE.

With that said, anything paxos related is pretty subtle so I'm not saying this 
would work, one would have to look at the idea a lot more closely. Also, this 
probably wouldn't be a trivial change at all. And to be upfront, I'm unlikely 
to personally have cycles to devote to this in the short term. 


> Pending endpoints size check for CAS doesn't play nicely with 
> writes-on-replacement
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13327
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13327
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>
> Consider this ring:
> 127.0.0.1  MR UP     JOINING -7301836195843364181
> 127.0.0.2    MR UP     NORMAL -7263405479023135948
> 127.0.0.3    MR UP     NORMAL -7205759403792793599
> 127.0.0.4   MR DOWN     NORMAL -7148113328562451251
> where 127.0.0.1 was bootstrapping for cluster expansion. Note that, due to 
> the failure of 127.0.0.4, 127.0.0.1 was stuck trying to stream from it and 
> making no progress.
> Then the down node was replaced so we had:
> 127.0.0.1  MR UP     JOINING -7301836195843364181
> 127.0.0.2    MR UP     NORMAL -7263405479023135948
> 127.0.0.3    MR UP     NORMAL -7205759403792793599
> 127.0.0.5   MR UP     JOINING -7148113328562451251
> It’s confusing in the ring - the first JOINING is a genuine bootstrap, the 
> second is a replacement. We now had CAS unavailables (but no non-CAS 
> unvailables). I think it’s because the pending endpoints check thinks that 
> 127.0.0.5 is gaining a range when it’s just replacing.
> The workaround is to kill the stuck JOINING node, but Cassandra shouldn’t 
> unnecessarily fail these requests.
> It also appears like required participants is bumped by 1 during a host 
> replacement so if the replacing host fails you will get unavailables and 
> timeouts.
> This is related to the check added in CASSANDRA-8346



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to