On Thu, Feb 22, 2018 at 12:54 AM, Paulo Motta <pauloricard...@gmail.com>
> > 1. It seems that for example when RF=3, each one of the three base
> replicas will send a view update to the fourth "pending node". While this
> is not wrong, it's also inefficient - why send three copies of the same
> update? Wouldn't it be more efficient that just one of the base replicas -
> the one which eventually will be paired with the pending node - should send
> the updates to it? Is there a problem with such a scheme?
> This optimization can be done when there's a single pending range per
> view replica set, but when there are multiple pending ranges and there
> are failures, it's possible that the paired view replica changes what
> can lead to missing updates. For instance, see the following scenario:
> - There are 2 pending ranges A' and B'.
Is this a realistic case when Cassandra (unless I'm missing something) is
limited to adding or removing a single node at a time?
How/when would we have two pending nodes for a single view partition?
I'm sure this can happen under some sort of generic range movement of some
sort (how does one initiate such movement, and why), but will it happen
under "normal" conditions of node bootstrap or decomission of a single node?
> - Base replica A sends update to pending-paired view replica A'.
> - Base replica B is down, so pending-paired view replica B' does not get
> - Range movement A' fails and B' succeeds.
> - B' becomes A new paired view replica.
> - A will be out of sync with B'
> Furthermore we would need to cache the ring state after the range
> movement is completed to be able to compute the pending-paired view
> replica but we don't have this info easily available currently, so it
> seems that it would not be a trivial change but perhaps worth pursuing
> in the single pending range case.
Yes, it seems it will not be trivial. But if this is the common case in
operations such as node addition or removal, it may significantly reduce
RF*2 to RF+1) the number of view updates being sent around, and avoid
MV update performance degredation during the streaming process.
> > 2. There's an optimization that when we're lucky enough that the paired
> view replica is the same as this base replica, mutateMV doesn't use the
> normal view-mutation-sending code (wrapViewBatchResponseHandler) and just
> writes the mutation locally. In particular, in this case we do NOT write to
> the pending node (unless I'm missing something). But, sometimes all
> replicas will be paired with themselves - this can happen for example when
> number of nodes is equal to RF, or when the base and view table have the
> same partition keys (but different clustering keys). In this case, it seems
> the pending node will not be written at all... Isn't this a bug?
> Good catch! This indeed seems to be a regression caused by
> CASSANDRA-13069, so I created CASSANDRA-14251 to restore the correct