Ilya, This is a great idea, but before we can ultimately decouple the affinity version from the topology version, we need to fix a few things with baseline topology first. Currently the in-memory caches are not using the baseline topology. We are going to fix this as a part of IEP-4 Phase II (baseline auto-adjust). Once fixed, we can safely assume that out-of-baseline node does not affect affinity distribution.
Agree with Dmitriy that we should start with simpler optimizations first. чт, 13 сент. 2018 г. в 15:58, Ilya Lantukh <ilant...@gridgain.com>: > Igniters, > > As most of you know, Ignite has a concept of AffinityTopologyVersion, which > is associated with nodes that are currently present in topology and a > global cluster state (active/inactive, baseline topology, started caches). > Modification of either of them involves process called Partition Map > Exchange (PME) and results in new AffinityTopologyVersion. At that moment > all new cache and compute grid operations are globally "frozen". This might > lead to indeterminate cache downtimes. > > However, our recent changes (esp. introduction of Baseline Topology) caused > me to re-think those concept. Currently there are many cases when we > trigger PME, but it isn't necessary. For example, adding/removing client > node or server node not in BLT should never cause partition map > modifications. Those events modify the *topology*, but *affinity* in > unaffected. On the other hand, there are events that affect only *affinity* > - most straightforward example is CacheAffinityChange event, which is > triggered after rebalance is finished to assign new primary/backup nodes. > So the term *AffinityTopologyVersion* now looks weird - it tries to "merge" > two entities that aren't always related. To me it makes sense to introduce > separate *AffinityVersion *and *TopologyVersion*, review all events that > currently modify AffinityTopologyVersion and split them into 3 categories: > those that modify only AffinityVersion, only TopologyVersion and both. It > will allow us to process such events using different mechanics and avoid > redundant steps, and also reconsider mapping of operations - some will be > mapped to topology, others - to affinity. > > Here is my view about how different event types theoretically can be > optimized: > 1. Client node start / stop: as stated above, no PME is needed, ticket > https://issues.apache.org/jira/browse/IGNITE-9558 is already in progress. > 2. Server node start / stop not from baseline: should be similar to the > previous case, since nodes outside of baseline cannot be partition owners. > 3. Start node in baseline: both affinity and topology versions should be > incremented, but it might be possible to optimize PME for such case and > avoid cluster-wide freeze. Partition assignments for such node are already > calculated, so we can simply put them all into MOVING state. However, it > might take significant effort to avoid race conditions and redesign our > architecture. > 4. Cache start / stop: starting or stopping one cache doesn't modify > partition maps for other caches. It should be possible to change this > procedure to skip PME and perform all necessary actions (compute affinity, > start/stop cache contexts on each node) in background, but it looks like a > very complex modification too. > 5. Rebalance finish: it seems possible to design a "lightweight" PME for > this case as well. If there were no node failures (and if there were, PME > should be triggered and rebalance should be cancelled anyways) all > partition states are already known by coordinator. Furthermore, no new > MOVING or OWNING node for any partition is introduced, so all previous > mappings should still be valid. > > For the latter complex cases in might be necessary to introduce "is > compatible" relationship between affinity versions. Operation needs to be > remapped only if new version isn't compatible with the previous one. > > Please share your thoughts. > > -- > Best regards, > Ilya >