Maxim, I looked at the code you provided. I think we need to add some timeout validation and output warning to logs on demander side in case there is no supply message within 30 secs and repeat demanding process. This should apply to any demand message throughout the rebalancing process not only the 1st one.
You can use the following message Failed to wait for supply message from node within 30 secs [cache=C, partId=XX] Alex Goncharuk do you have comments here? Yakov Zhdanov www.gridgain.com 2018-07-14 19:45 GMT+03:00 Maxim Muzafarov <[email protected]>: > Yakov, > > Yes, you're right. Whole rebalancing progress will be stopped. > > Actually, rebalancing order doesn't matter you right it too. Javadoc just > says the idea how rebalance should work for caches but in fact it don't > work as described. Personally, I'd prefer to start rebalance of each cache > group in async way independently. > > Please, look at my reproducer [1]. > > Scenario: > Cluster with two REPLICATEDED caches. > Start new node. > First rebalance cache group is failed to start (e.g. network issues) - it's > OK. > Second rebalance cache group will neber be started - whole futher progress > stucks (I think rebalance here should be started!). > > > [1] > https://github.com/Mmuzaf/ignite/blob/rebalance-cancel/ > modules/core/src/test/java/org/apache/ignite/internal/ > processors/cache/distributed/rebalancing/GridCacheRebalancingCancelSelf > Test.java > > пт, 13 июл. 2018 г. в 17:46, Yakov Zhdanov <[email protected]>: > > > Maxim, I do not understand the problem. Imagine I do not have any > ordering > > but rebalancing of some cache fails to start - so in my understanding > > overall rebalancing progress becomes blocked. Is that true? > > > > Can you pleaes provide reproducer for your problem? > > > > --Yakov > > > > 2018-07-09 16:42 GMT+03:00 Maxim Muzafarov <[email protected]>: > > > > > Hello Igniters, > > > > > > Each cache group has “rebalance order” property. As javadoc for > > > getRebalanceOrder() says: “Note that cache with order {@code 0} does > not > > > participate in ordering. This means that cache with rebalance order > > {@code > > > 0} will never wait for any other caches. All caches with order {@code > 0} > > > will be rebalanced right away concurrently with each other and ordered > > > rebalance processes. If not set, cache order is 0, i.e. rebalancing is > > not > > > ordered.” > > > > > > In fact GridCachePartitionExchangeManager always build the chain of > > > rebalancing cache groups to start (even for cache order ZERO): > > > > > > ignite-sys-cache -> cacheR -> cacheR3 -> cacheR2 -> cacheR5 -> cacheR1. > > > > > > If one of these groups will fail to start further groups will never be > > run. > > > > > > * Question 1*: Should we fix javadoc description or create a bug for > > fixing > > > such rebalance behavior? > > > > > > [1] > > > https://github.com/apache/ignite/blob/master/modules/ > > > core/src/main/java/org/apache/ignite/internal/processors/cache/ > > > GridCachePartitionExchangeManager.java#L2630 > > > > > > -- > -- > Maxim Muzafarov >
