Re: Async cache groups rebalance not started with rebalanceOrder ZERO

Anton Vinogradov Thu, 26 Jul 2018 06:42:02 -0700

Maxim,

1) There is a typo at javadoc, feel free to fix it.


2) It's a bad idea to rebalance more than 1 cache simultaneously.
- It's hard to determine error reason in that case when (not "if", but
"when" :) ) we'll gain issue at prod (100+ caches case).
- We should have limited rebalance load.
Rebalance should not cause thousand messages per second, this will lead to
cluster death.
rebalanceThreadPoolSize(), rebalanceBatchSize() and
rebalanceBatchesPrefetchCount() provides us guarantee of limited but proper
load.

3) Correct fix for situation you described is to restart rebalancing
(chained) for both caches on timeout.
And that's what we'll gain once cluster detect that node have IO issues and
start new topology without it.

So, seems, only javadoc fixes required.

ср, 18 июл. 2018 г. в 15:13, Yakov Zhdanov <[email protected]>:

> Maxim, I checked and it seems that send retry count is used only in cache
> IO manager and the usage is semantically very far from what I suggest.
> Resend count limits the attempts count, while I meant successfull send but
> possible problems on supplier side.
>
> --Yakov
>
> 2018-07-17 19:01 GMT+03:00 Maxim Muzafarov <[email protected]>:
>
> > Yakov,
> >
> > But we already have DFLT_SEND_RETRY_CNT and DFLT_SEND_RETRY_DELAY for
> > configuring our CommunicationSPI behavior. What if user configure this
> > parameters his own way and he will see a lot of WARN messages in log
> which
> > have no sense?
> >
> > May be we use GridCachePartitionExchangeManager#forceRebalance (or may
> > be forceReassign) if we fail rebalance all that retries. What do you
> think?
> >
> >
> >
> > пн, 16 июл. 2018 г. в 21:12, Yakov Zhdanov <[email protected]>:
> >
> > > Maxim, I looked at the code you provided. I think we need to add some
> > > timeout validation and output warning to logs on demander side in case
> > > there is no supply message within 30 secs and repeat demanding process.
> > > This should apply to any demand message throughout the rebalancing
> > process
> > > not only the 1st one.
> > >
> > > You can use the following message
> > >
> > > Failed to wait for supply message from node within 30 secs [cache=C,
> > > partId=XX]
> > >
> > > Alex Goncharuk do you have comments here?
> > >
> > > Yakov Zhdanov
> > > www.gridgain.com
> > >
> > > 2018-07-14 19:45 GMT+03:00 Maxim Muzafarov <[email protected]>:
> > >
> > > > Yakov,
> > > >
> > > > Yes, you're right. Whole rebalancing progress will be stopped.
> > > >
> > > > Actually, rebalancing order doesn't matter you right it too. Javadoc
> > just
> > > > says the idea how rebalance should work for caches but in fact it
> don't
> > > > work as described. Personally, I'd prefer to start rebalance of each
> > > cache
> > > > group in async way independently.
> > > >
> > > > Please, look at my reproducer [1].
> > > >
> > > > Scenario:
> > > > Cluster with two REPLICATEDED caches.
> > > > Start new node.
> > > > First rebalance cache group is failed to start (e.g. network issues)
> -
> > > it's
> > > > OK.
> > > > Second rebalance cache group will neber be started - whole futher
> > > progress
> > > > stucks (I think rebalance here should be started!).
> > > >
> > > >
> > > > [1]
> > > > https://github.com/Mmuzaf/ignite/blob/rebalance-cancel/
> > > > modules/core/src/test/java/org/apache/ignite/internal/
> > > > processors/cache/distributed/rebalancing/
> > GridCacheRebalancingCancelSelf
> > > > Test.java
> > > >
> > > > пт, 13 июл. 2018 г. в 17:46, Yakov Zhdanov <[email protected]>:
> > > >
> > > > > Maxim, I do not understand the problem. Imagine I do not have any
> > > > ordering
> > > > > but rebalancing of some cache fails to start - so in my
> understanding
> > > > > overall rebalancing progress becomes blocked. Is that true?
> > > > >
> > > > > Can you pleaes provide reproducer for your problem?
> > > > >
> > > > > --Yakov
> > > > >
> > > > > 2018-07-09 16:42 GMT+03:00 Maxim Muzafarov <[email protected]>:
> > > > >
> > > > > > Hello Igniters,
> > > > > >
> > > > > > Each cache group has “rebalance order” property. As javadoc for
> > > > > > getRebalanceOrder() says: “Note that cache with order {@code 0}
> > does
> > > > not
> > > > > > participate in ordering. This means that cache with rebalance
> order
> > > > > {@code
> > > > > > 0} will never wait for any other caches. All caches with order
> > {@code
> > > > 0}
> > > > > > will be rebalanced right away concurrently with each other and
> > > ordered
> > > > > > rebalance processes. If not set, cache order is 0, i.e.
> rebalancing
> > > is
> > > > > not
> > > > > > ordered.”
> > > > > >
> > > > > > In fact GridCachePartitionExchangeManager always build the chain
> > of
> > > > > > rebalancing cache groups to start (even for cache order ZERO):
> > > > > >
> > > > > > ignite-sys-cache -> cacheR -> cacheR3 -> cacheR2 -> cacheR5 ->
> > > cacheR1.
> > > > > >
> > > > > > If one of these groups will fail to start further groups will
> never
> > > be
> > > > > run.
> > > > > >
> > > > > > * Question 1*: Should we fix javadoc description or create a bug
> > for
> > > > > fixing
> > > > > > such rebalance behavior?
> > > > > >
> > > > > > [1]
> > > > > > https://github.com/apache/ignite/blob/master/modules/
> > > > > > core/src/main/java/org/apache/ignite/internal/processors/cache/
> > > > > > GridCachePartitionExchangeManager.java#L2630
> > > > > >
> > > > >
> > > > --
> > > > --
> > > > Maxim Muzafarov
> > > >
> > >
> > --
> > --
> > Maxim Muzafarov
> >
>

Re: Async cache groups rebalance not started with rebalanceOrder ZERO

Reply via email to