Hi, I was trying to understand how view tables are updated during a period
of range movements, namely bootstrapping of a new node or decommissioning
one of the nodes. In particular, during the period of data streaming, we
can have a new replica on a "pending node" to which we also need to send
the view update.

I looked at the mutateMV() code, and think I spotted two issues with it,
and I wonder if I'm missing something or these are real problems:

1. It seems that for example when RF=3, each one of the three base replicas
will send a view update to the fourth "pending node". While this is not
wrong, it's also inefficient - why send three copies of the same update?
Wouldn't it be more efficient that just one of the base replicas - the one
which eventually will be paired with the pending node - should send the
updates to it? Is there a problem with such a scheme?

2. There's an optimization that when we're lucky enough that the paired
view replica is the same as this base replica, mutateMV doesn't use the
normal view-mutation-sending code (wrapViewBatchResponseHandler) and just
writes the mutation locally. In particular, in this case we do NOT write to
the pending node (unless I'm missing something). But, sometimes all
replicas will be paired with themselves - this can happen for example when
number of nodes is equal to RF, or when the base and view table have the
same partition keys (but different clustering keys). In this case, it seems
the pending node will not be written at all... Isn't this a bug?

The strange thing about issue 2 is that this code used to be correct (at
least according to my understanding...) - it used to avoid this
optimization if pendingNodes was not empty. But then this was changed in
commit 12103653f31. Why?
https://issues.apache.org/jira/browse/CASSANDRA-13069 contains an
explanation to that change:
     "I also removed the pendingEndpoints.isEmpty() condition to skip the
batchlog for local mutations, since this was a pre-CASSANDRA-10674 leftover
when ViewUtils.getViewNaturalEndpoint returned the local address to force
non-paired replicas to be written to the batchlog." (Paulo Motta, 21/Dec/16)

But I don't understand this explanation... Being paired with yourself is
not only a "trick", but also something which really happens (by chance or
in some cases as I showed above, always), and needs to be handled
correctly, even if the cluster grows. If none of the base replicas will
send the view update to the pending node, it will end up missing this
update...

Thanks,
Nadav.


--
Nadav Har'El
n...@scylladb.com

Reply via email to