Ilya,

This is a great idea, but before we can ultimately decouple the affinity
version from the topology version, we need to fix a few things with
baseline topology first. Currently the in-memory caches are not using the
baseline topology. We are going to fix this as a part of IEP-4 Phase II
(baseline auto-adjust). Once fixed, we can safely assume that
out-of-baseline node does not affect affinity distribution.

Agree with Dmitriy that we should start with simpler optimizations first.

чт, 13 сент. 2018 г. в 15:58, Ilya Lantukh <ilant...@gridgain.com>:

> Igniters,
>
> As most of you know, Ignite has a concept of AffinityTopologyVersion, which
> is associated with nodes that are currently present in topology and a
> global cluster state (active/inactive, baseline topology, started caches).
> Modification of either of them involves process called Partition Map
> Exchange (PME) and results in new AffinityTopologyVersion. At that moment
> all new cache and compute grid operations are globally "frozen". This might
> lead to indeterminate cache downtimes.
>
> However, our recent changes (esp. introduction of Baseline Topology) caused
> me to re-think those concept. Currently there are many cases when we
> trigger PME, but it isn't necessary. For example, adding/removing client
> node or server node not in BLT should never cause partition map
> modifications. Those events modify the *topology*, but *affinity* in
> unaffected. On the other hand, there are events that affect only *affinity*
> - most straightforward example is CacheAffinityChange event, which is
> triggered after rebalance is finished to assign new primary/backup nodes.
> So the term *AffinityTopologyVersion* now looks weird - it tries to "merge"
> two entities that aren't always related. To me it makes sense to introduce
> separate *AffinityVersion *and *TopologyVersion*, review all events that
> currently modify AffinityTopologyVersion and split them into 3 categories:
> those that modify only AffinityVersion, only TopologyVersion and both. It
> will allow us to process such events using different mechanics and avoid
> redundant steps, and also reconsider mapping of operations - some will be
> mapped to topology, others - to affinity.
>
> Here is my view about how different event types theoretically can be
> optimized:
> 1. Client node start / stop: as stated above, no PME is needed, ticket
> https://issues.apache.org/jira/browse/IGNITE-9558 is already in progress.
> 2. Server node start / stop not from baseline: should be similar to the
> previous case, since nodes outside of baseline cannot be partition owners.
> 3. Start node in baseline: both affinity and topology versions should be
> incremented, but it might be possible to optimize PME for such case and
> avoid cluster-wide freeze. Partition assignments for such node are already
> calculated, so we can simply put them all into MOVING state. However, it
> might take significant effort to avoid race conditions and redesign our
> architecture.
> 4. Cache start / stop: starting or stopping one cache doesn't modify
> partition maps for other caches. It should be possible to change this
> procedure to skip PME and perform all necessary actions (compute affinity,
> start/stop cache contexts on each node) in background, but it looks like a
> very complex modification too.
> 5. Rebalance finish: it seems possible to design a "lightweight" PME for
> this case as well. If there were no node failures (and if there were, PME
> should be triggered and rebalance should be cancelled anyways) all
> partition states are already known by coordinator. Furthermore, no new
> MOVING or OWNING node for any partition is introduced, so all previous
> mappings should still be valid.
>
> For the latter complex cases in might be necessary to introduce "is
> compatible" relationship between affinity versions. Operation needs to be
> remapped only if new version isn't compatible with the previous one.
>
> Please share your thoughts.
>
> --
> Best regards,
> Ilya
>

Reply via email to