Igniters,

As most of you know, Ignite has a concept of AffinityTopologyVersion, which
is associated with nodes that are currently present in topology and a
global cluster state (active/inactive, baseline topology, started caches).
Modification of either of them involves process called Partition Map
Exchange (PME) and results in new AffinityTopologyVersion. At that moment
all new cache and compute grid operations are globally "frozen". This might
lead to indeterminate cache downtimes.

However, our recent changes (esp. introduction of Baseline Topology) caused
me to re-think those concept. Currently there are many cases when we
trigger PME, but it isn't necessary. For example, adding/removing client
node or server node not in BLT should never cause partition map
modifications. Those events modify the *topology*, but *affinity* in
unaffected. On the other hand, there are events that affect only *affinity*
- most straightforward example is CacheAffinityChange event, which is
triggered after rebalance is finished to assign new primary/backup nodes.
So the term *AffinityTopologyVersion* now looks weird - it tries to "merge"
two entities that aren't always related. To me it makes sense to introduce
separate *AffinityVersion *and *TopologyVersion*, review all events that
currently modify AffinityTopologyVersion and split them into 3 categories:
those that modify only AffinityVersion, only TopologyVersion and both. It
will allow us to process such events using different mechanics and avoid
redundant steps, and also reconsider mapping of operations - some will be
mapped to topology, others - to affinity.

Here is my view about how different event types theoretically can be
optimized:
1. Client node start / stop: as stated above, no PME is needed, ticket
https://issues.apache.org/jira/browse/IGNITE-9558 is already in progress.
2. Server node start / stop not from baseline: should be similar to the
previous case, since nodes outside of baseline cannot be partition owners.
3. Start node in baseline: both affinity and topology versions should be
incremented, but it might be possible to optimize PME for such case and
avoid cluster-wide freeze. Partition assignments for such node are already
calculated, so we can simply put them all into MOVING state. However, it
might take significant effort to avoid race conditions and redesign our
architecture.
4. Cache start / stop: starting or stopping one cache doesn't modify
partition maps for other caches. It should be possible to change this
procedure to skip PME and perform all necessary actions (compute affinity,
start/stop cache contexts on each node) in background, but it looks like a
very complex modification too.
5. Rebalance finish: it seems possible to design a "lightweight" PME for
this case as well. If there were no node failures (and if there were, PME
should be triggered and rebalance should be cancelled anyways) all
partition states are already known by coordinator. Furthermore, no new
MOVING or OWNING node for any partition is introduced, so all previous
mappings should still be valid.

For the latter complex cases in might be necessary to introduce "is
compatible" relationship between affinity versions. Operation needs to be
remapped only if new version isn't compatible with the previous one.

Please share your thoughts.

-- 
Best regards,
Ilya

Reply via email to