[
https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634931#comment-14634931
]
Benedict edited comment on CASSANDRA-6477 at 7/21/15 10:49 AM:
---------------------------------------------------------------
So, I've been mulling on this, and I think we can safely guarantee eventual
consistency even without a write-path repair. The question is only timeliness,
however this can be problematic with or without write-path repair, since
DELETEs and UPDATEs for the same operation are not replicated in a coordinated
fashion.
h3. Timeliness / Consistency Caveats
If, for instance, three different updates are sent at once, and each
base-replica receives a different update, each may be propagated onto two
different nodes via repair, giving 9 nodes with live data. Eventually one more
base-replica receives each of the original updates, and issue deletes for their
data. So we have 6 nodes with live data, 3 with tombstones. The batchlog is now
happy, however the system is in an inconsistent state until either the MV
replicas are repaired again or - with write-path-repair - the base-replicas
are.
Without write-path repair we may be more prone to this problem, but I don't
think dramatically more, although I haven't the time to think it through
exhaustively. AFAICT, repair must necessarily be able to introduce
inconsistency that can only be resolved by another repair (which itself can, of
course, introduce more inconsistency).
I'm pretty sure there are a multiplicity of similar scenarios, and certainly
there are less extreme scenarios. Two competing updates and one repair are
enough, so long as it's the "wrong" update repaired, and to the "wrong" MV
replica it's repaired to.
h3. Correctness Caveats
There are also other problems unique to MV: if we lose any two base-replicas
(which with vnodes means any two nodes), we can be left with ghost records that
are *never* purged. So any concurrent loss of two nodes means we really need to
truncate the MV and rebuild, or we need a special truncation record to truncate
only the portion that was owned by those two nodes. This happens for any
updates that were received only by those two nodes, but were then proxied on
(or written to their batchlogs). This can of course affect any normal QUORUM
updates to the cluster, the difference being that the user has no control over
these, and simply resending your update to the cluster does not resolve the
problem as it would in the current world. Users performing updates to single
partitions that would have never been affected by this now also have this to
worry about.
h3. Confidence
Certainly, I think we need to a *bit* of formal analysis or simulation of what
the possible cluster states are. Ideally a simple model of how each piece of
infrastructure works could be constructed in a single process to run the
equivalent of years of operations a normal cluster would execute, to explore
the levels of "badness" we can expect. That's just my opinion, but I think it
would be invaluable, because after spending some spare time thinking about
these problems, I think it is a _very hard thing to do_, and I would rather not
trust our feelings about correctness.
h3. Multiple Columns
As far as multiple columns are concerned: I think we may need to go back to the
drawing board there. It's actually really easy to demonstrate the cluster
getting into broken states. Say you have three columns, A B C, and you send
three competing updates a b c to their respective columns; previously all held
the value _. If they arrive in different orders on each base-replica we can end
up with 6 different MV states around the cluster. If any base replica dies, you
don't know which of those 6 intermediate states were taken (and probably
replicated) by its MV replicas. This problem grows exponentially as you add
"competing" updates (which, given split brain, can compete over arbitrarily
long intervals).
This is where my concern about a "single (base) node" dependency comes in, but
after consideration it's clear that with a single column this problem is
avoided because it's never ambiguous what the _old_ state was. If you encounter
a mutation that is shadowed by your current data, you can always issue a delete
for the correct prior state. With multiple columns that is no longer possible.
I'm pretty sure the presence of multiple columns introduces other issues with
each of the other moving parts.
h3. Important Implementation Detail
Had a quick browse, and I found that writeOrder is now wrapping multiple
synchronous network operations. OpOrders are only intended to wrap local
operations, so this should be rejigged to avoid locking up the cluster.
h3. TL;DR
Anyway, hopefully that wall of text isn't too unappetizing, and is somewhat
helpful. To summarize my thoughts, I think the following things are worthy of
due consideration and potentially highlighting to the user:
* Availability:
** RF-1 node failures cause parts of the MV to receive NO updates, and remain
incapable of responding correctly to any queries on their portion (this will
apply unevenly for a given base update, meaning multiple generations of value
could persist concurrently in the cluster)
** Any node loss results in a significantly larger hit to the consistency of
the MV (in my example, 20% loss of QUORUM to cluster resulted in 57% loss to MV)
** Both of these are potentially avoidable by ensuring we try another node if
"ours" is down, but due consideration needs to be given to if this potentially
results in more cluster inconsistencies
* Repair seems to require possibility of introducing inconsistent cluster
states that can only be repaired by repair (which introduces more such states
at the same time), resulting in potentially lengthy inconsistencies, or repair
frequency greater than can operationally be managed rught now
* Loss of any two nodes in a vnode cluster can result in permanent inconsistency
* Have we spotted all of the caveats?
* Rejig writeOrder usage
* Multiple columns need a lot more thought
was (Author: benedict):
So, I've been mulling on this, and I think we can safely guarantee eventual
consistency even without a write-path repair. The question is only timeliness,
however this can be problematic with or without write-path repair, since
DELETEs and UPDATEs for the same operation are not replicated in a coordinated
fashion.
h3. Timeliness / Consistency Caveats
If, for instance, three different updates are sent at once, and each
base-replica receives a different update, each may be propagated onto two
different nodes via repair, giving 9 nodes with live data. Eventually one more
base-replica receives each of the original updates, and two issue deletes for
their data. So we have 7 nodes with live data, 2 with tombstones. The batchlog
is now happy, however the system is in an inconsistent state until either the
MV replicas are repaired again or - with write-path-repair - the base-replicas
are.
Without write-path repair we may be more prone to this problem, but I don't
think dramatically more, although I haven't the time to think it through
exhaustively. AFAICT, repair must necessarily be able to introduce
inconsistency that can only be resolved by another repair (which itself can, of
course, introduce more inconsistency).
I'm pretty sure there are a multiplicity of similar scenarios, and certainly
there are less extreme scenarios. Two competing updates and one repair are
enough, so long as it's the "wrong" update repaired, and to the "wrong" MV
replica it's repaired to.
h3. Correctness Caveats
There are also other problems unique to MV: if we lose any two base-replicas
(which with vnodes means any two nodes), we can be left with ghost records that
are *never* purged. So any concurrent loss of two nodes means we really need to
truncate the MV and rebuild, or we need a special truncation record to truncate
only the portion that was owned by those two nodes. This happens for any
updates that were received only by those two nodes, but were then proxied on
(or written to their batchlogs). This can of course affect any normal QUORUM
updates to the cluster, the difference being that the user has no control over
these, and simply resending your update to the cluster does not resolve the
problem as it would in the current world. Users performing updates to single
partitions that would have never been affected by this now also have this to
worry about.
h3. Confidence
Certainly, I think we need to a *bit* of formal analysis or simulation of what
the possible cluster states are. Ideally a simple model of how each piece of
infrastructure works could be constructed in a single process to run the
equivalent of years of operations a normal cluster would execute, to explore
the levels of "badness" we can expect. That's just my opinion, but I think it
would be invaluable, because after spending some spare time thinking about
these problems, I think it is a _very hard thing to do_, and I would rather not
trust our feelings about correctness.
h3. Multiple Columns
As far as multiple columns are concerned: I think we may need to go back to the
drawing board there. It's actually really easy to demonstrate the cluster
getting into broken states. Say you have three columns, A B C, and you send
three competing updates a b c to their respective columns; previously all held
the value _. If they arrive in different orders on each base-replica we can end
up with 6 different MV states around the cluster. If any base replica dies, you
don't know which of those 6 intermediate states were taken (and probably
replicated) by its MV replicas. This problem grows exponentially as you add
"competing" updates (which, given split brain, can compete over arbitrarily
long intervals).
This is where my concern about a "single (base) node" dependency comes in, but
after consideration it's clear that with a single column this problem is
avoided because it's never ambiguous what the _old_ state was. If you encounter
a mutation that is shadowed by your current data, you can always issue a delete
for the correct prior state. With multiple columns that is no longer possible.
I'm pretty sure the presence of multiple columns introduces other issues with
each of the other moving parts.
h3. Important Implementation Detail
Had a quick browse, and I found that writeOrder is now wrapping multiple
synchronous network operations. OpOrders are only intended to wrap local
operations, so this should be rejigged to avoid locking up the cluster.
h3. TL;DR
Anyway, hopefully that wall of text isn't too unappetizing, and is somewhat
helpful. To summarize my thoughts, I think the following things are worthy of
due consideration and potentially highlighting to the user:
* Availability:
** RF-1 node failures cause parts of the MV to receive NO updates, and remain
incapable of responding correctly to any queries on their portion (this will
apply unevenly for a given base update, meaning multiple generations of value
could persist concurrently in the cluster)
** Any node loss results in a significantly larger hit to the consistency of
the MV (in my example, 20% loss of QUORUM to cluster resulted in 57% loss to MV)
** Both of these are potentially avoidable by ensuring we try another node if
"ours" is down, but due consideration needs to be given to if this potentially
results in more cluster inconsistencies
* Repair seems to require possibility of introducing inconsistent cluster
states that can only be repaired by repair (which introduces more such states
at the same time), resulting in potentially lengthy inconsistencies, or repair
frequency greater than can operationally be managed rught now
* Loss of any two nodes in a vnode cluster can result in permanent inconsistency
* Have we spotted all of the caveats?
* Rejig writeOrder usage
* Multiple columns need a lot more thought
> Materialized Views (was: Global Indexes)
> ----------------------------------------
>
> Key: CASSANDRA-6477
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6477
> Project: Cassandra
> Issue Type: New Feature
> Components: API, Core
> Reporter: Jonathan Ellis
> Assignee: Carl Yeksigian
> Labels: cql
> Fix For: 3.0 beta 1
>
> Attachments: test-view-data.sh, users.yaml
>
>
> Local indexes are suitable for low-cardinality data, where spreading the
> index across the cluster is a Good Thing. However, for high-cardinality
> data, local indexes require querying most nodes in the cluster even if only a
> handful of rows is returned.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)