[ 
https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634931#comment-14634931
 ] 

Benedict commented on CASSANDRA-6477:
-------------------------------------

So, I've been mulling on this, and I think we can safely guarantee eventual 
consistency even without a write-path repair. The question is only timeliness, 
however this can be problematic with or without write-path repair, since 
DELETEs and UPDATEs for the same operation are not replicated in a coordinated 
fashion.

h3. Timeliness / Consistency Caveats
If, for instance, three different updates are sent at once, and each 
base-replica receives a different update, each may be propagated onto two 
different nodes via repair, giving 9 nodes with live data. Eventually one more 
base-replica receives each of the original updates, and two issue deletes for 
their data. So we have 7 nodes with live data, 2 with tombstones. The batchlog 
is now happy, however the system is in an inconsistent state until either the 
MV replicas are repaired again or - with write-path-repair - the base-replicas 
are. 

Without write-path repair we may be more prone to this problem, but I don't 
think dramatically more, although I haven't the time to think it through 
exhaustively. AFAICT, repair must necessarily be able to introduce 
inconsistency that can only be resolved by another repair (which itself can, of 
course, introduce more inconsistency).

I'm pretty sure there are a multiplicity of similar scenarios, and certainly 
there are less extreme scenarios. Two competing updates and one repair are 
enough, so long as it's the "wrong" update repaired, and to the "wrong" MV 
replica it's repaired to.

h3. Correctness Caveats
There are also other problems unique to MV: if we lose any two base-replicas 
(which with vnodes means any two nodes), we can be left with ghost records that 
are *never* purged. So any concurrent loss of two nodes means we really need to 
truncate the MV and rebuild, or we need a special truncation record to truncate 
only the portion that was owned by those two nodes. This happens for any 
updates that were received only by those two nodes, but were then proxied on 
(or written to their batchlogs). This can of course affect any normal QUORUM 
updates to the cluster, the difference being that the user has no control over 
these, and simply resending your update to the cluster does not resolve the 
problem as it would in the current world. Users performing updates to single 
partitions that would have never been affected by this now also have this to 
worry about.

h3. Confidence
Certainly, I think we need to a *bit* of formal analysis or simulation of what 
the possible cluster states are. Ideally a simple model of how each piece of 
infrastructure works could be constructed in a single process to run the 
equivalent of years of operations a normal cluster would execute, to explore 
the levels of "badness" we can expect. That's just my opinion, but I think it 
would be invaluable, because after spending some spare time thinking about 
these problems, I think it is a _very hard thing to do_, and I would rather not 
trust our feelings about correctness.

h3. Multiple Columns
As far as multiple columns are concerned: I think we may need to go back to the 
drawing board there. It's actually really easy to demonstrate the cluster 
getting into broken states. Say you have three columns, A B C, and you send 
three competing updates a b c to their respective columns; previously all held 
the value _. If they arrive in different orders on each base-replica we can end 
up with 6 different MV states around the cluster. If any base replica dies, you 
don't know which of those 6 intermediate states were taken (and probably 
replicated) by its MV replicas. This problem grows exponentially as you add 
"competing" updates (which, given split brain, can compete over arbitrarily 
long intervals).

This is where my concern about a "single (base) node" dependency comes in, but 
after consideration it's clear that with a single column this problem is 
avoided because it's never ambiguous what the _old_ state was. If you encounter 
a mutation that is shadowed by your current data, you can always issue a delete 
for the correct prior state. With multiple columns that is no longer possible.

I'm pretty sure the presence of multiple columns introduces other issues with 
each of the other moving parts.

h3. Important Implementation Detail
Had a quick browse, and I found that writeOrder is now wrapping multiple 
synchronous network operations. OpOrders are only intended to wrap local 
operations, so this should be rejigged to avoid locking up the cluster.

h3. TL;DR
Anyway, hopefully that wall of text isn't too unappetizing, and is somewhat 
helpful. To summarize my thoughts, I think the following things are worthy of 
due consideration and potentially highlighting to the user:

* Availability:
** RF-1 node failures cause parts of the MV to receive NO updates, and remain 
incapable of responding correctly to any queries on their portion (this will 
apply unevenly for a given base update, meaning multiple generations of value 
could persist concurrently in the cluster)
** Any node loss results in a significantly larger hit to the consistency of 
the MV (in my example, 20% loss of QUORUM to cluster resulted in 57% loss to MV)
** Both of these are potentially avoidable by ensuring we try another node if 
"ours" is down, but due consideration needs to be given to if this potentially 
results in more cluster inconsistencies
* Repair seems to require possibility of introducing inconsistent cluster 
states that can only be repaired by repair (which introduces more such states 
at the same time), resulting in potentially lengthy inconsistencies, or repair 
frequency greater than can operationally be managed rught now
* Loss of any two nodes in a vnode cluster can result in permanent inconsistency
* Have we spotted all of the caveats?
* Rejig writeOrder usage
* Multiple columns need a lot more thought



> Materialized Views (was: Global Indexes)
> ----------------------------------------
>
>                 Key: CASSANDRA-6477
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6477
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>            Assignee: Carl Yeksigian
>              Labels: cql
>             Fix For: 3.0 beta 1
>
>         Attachments: test-view-data.sh, users.yaml
>
>
> Local indexes are suitable for low-cardinality data, where spreading the 
> index across the cluster is a Good Thing.  However, for high-cardinality 
> data, local indexes require querying most nodes in the cluster even if only a 
> handful of rows is returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to