Re: Degradation of availability when using NTS and RF > number of racks

Benedict Tue, 07 Mar 2023 08:17:01 -0800

My view is that if this is a pretty serious bug. I wonder if transactional 
metadata will make it possible to safely fix this for users without rebuilding 
(only via opt-in, of course).


> On 7 Mar 2023, at 15:54, Miklosovic, Stefan <stefan.mikloso...@netapp.com> 
> wrote:
> 
> Thanks everybody for the feedback.
> 
> I think that emitting a warning upon keyspace creation (and alteration) 
> should be enough for starters. If somebody can not live without 100% bullet 
> proof solution over time we might choose some approach from the offered ones. 
> As the saying goes there is no silver bullet. If we decide to implement that 
> new strategy, we would probably emit warnings anyway on NTS but it would be 
> already done so just new strategy would be provided.
> 
> ________________________________________
> From: Paulo Motta <pauloricard...@gmail.com>
> Sent: Monday, March 6, 2023 17:48
> To: dev@cassandra.apache.org
> Subject: Re: Degradation of availability when using NTS and RF > number of 
> racks
> 
> NetApp Security WARNING: This is an external email. Do not click links or 
> open attachments unless you recognize the sender and know the content is safe.
> 
> 
> 
> It's a bit unfortunate that NTS does not maintain the ability to lose a rack 
> without loss of quorum for RF > #racks > 2, since this can be easily achieved 
> by evenly placing replicas across all racks.
> 
> Since RackAwareTopologyStrategy is a superset of NetworkTopologyStrategy, 
> can't we just use the new correct placement logic for newly created keyspaces 
> instead of having a new strategy?
> 
> The placement logic would be backwards-compatible for RF <= #racks. On 
> upgrade, we could mark existing keyspaces with RF > #racks with 
> use_legacy_replica_placement=true to maintain backwards compatibility and log 
> a warning that the rack loss guarantee is not maintained for keyspaces 
> created before the fix. Old keyspaces with RF <=#racks would still work with 
> the new replica placement. The downside is that we would need to keep the old 
> NTS logic around, or we could eventually deprecate it and require users to 
> migrate keyspaces using the legacy placement strategy.
> 
> Alternatively we could have RackAwareTopologyStrategy and fail NTS keyspace 
> creation for RF > #racks and indicate users to use RackAwareTopologyStrategy 
> to maintain the quorum guarantee on rack loss or set an override flag 
> "support_quorum_on_rack_loss=false". This feels a bit iffy though since it 
> could potentially confuse users about when to use each strategy.
> 
> On Mon, Mar 6, 2023 at 5:51 AM Miklosovic, Stefan 
> <stefan.mikloso...@netapp.com<mailto:stefan.mikloso...@netapp.com>> wrote:
> Hi all,
> 
> some time ago we identified an issue with NetworkTopologyStrategy. The 
> problem is that when RF > number of racks, it may happen that NTS places 
> replicas in such a way that when whole rack is lost, we lose QUORUM and data 
> are not available anymore if QUORUM CL is used.
> 
> To illustrate this problem, lets have this setup:
> 
> 9 nodes in 1 DC, 3 racks, 3 nodes per rack. RF = 5. Then, NTS could place 
> replicas like this: 3 replicas in rack1, 1 replica in rack2, 1 replica in 
> rack3. Hence, when rack1 is lost, we do not have QUORUM.
> 
> It seems to us that there is already some logic around this scenario (1) but 
> the implementation is not entirely correct. This solution is not computing 
> the replica placement correctly so the above problem would be addressed.
> 
> We created a draft here (2, 3) which fixes it.
> 
> There is also a test which simulates this scenario. When I assign 256 tokens 
> to each node randomly (by same mean as generatetokens command uses) and I try 
> to compute natural replicas for 1 billion random tokens and I compute how 
> many cases there will be when 3 replicas out of 5 are inserted in the same 
> rack (so by losing it we would lose quorum), for above setup I get around 6%.
> 
> For 12 nodes, 3 racks, 4 nodes per rack, rf = 5, this happens in 10% cases.
> 
> To interpret this number, it basically means that with such topology, RF and 
> CL, when a random rack fails completely, when doing a random read, there is 
> 6% chance that data will not be available (or 10%, respectively).
> 
> One caveat here is that NTS is not compatible with this new strategy anymore 
> because it will place replicas differently. So I guess that fixing this in 
> NTS will not be possible because of upgrades. I think people would need to 
> setup completely new keyspace and somehow migrate data if they wish or they 
> just start from scratch with this strategy.
> 
> Questions:
> 
> 1) do you think this is meaningful to fix and it might end up in trunk?
> 
> 2) should not we just ban this scenario entirely? It might be possible to 
> check the configuration upon keyspace creation (rf > num of racks) and if we 
> see this is problematic we would just fail that query? Guardrail maybe?
> 
> 3) people in the ticket mention writing "CEP" for this but I do not see any 
> reason to do so. It is just a strategy as any other. What would that CEP 
> would even be about? Is this necessary?
> 
> Regards
> 
> (1) 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java#L126-L128
> (2) https://github.com/apache/cassandra/pull/2191
> (3) https://issues.apache.org/jira/browse/CASSANDRA-16203

Re: Degradation of availability when using NTS and RF > number of racks

Reply via email to