Hi.

The proposed approach to improve rebalancing efficiency looks a little bit
naive to me.

A mature solution to a problem should include a computation of a "cost" for
each rebalancing type.

The cost formula should take into consideration various aspects, like
number of pages in the memory cache, page replacement, a disk read speed, a
number of indexes, etc.

For a short term solution the proposed heuristic looks acceptable.

My suggestions for an implementation:

1. Implement the cost framework right now.
The rebalance cost is measured as a sum of all costs for each rebalancing
group. This sum should be minimized.
Rebalancing behavior should be dependent only on cost formula outcome,
without the need of modifying other code.

2. Avoid unnecessary WAL scanning.
Ideally we should scan only segments containing required updates.
We can build for each partition a tracking data structure and adjust it on
each checkpoint and WAL archive removal, then use it for efficient scanning
during WAL iteration.

3. Avoid any calculations during the PME sync phase. The coordinator should
only collect available WAL history, switch lagging OWNING partitions to
MOVING, and send all this to the nodes.
After PME nodes should calculate a cost and choose the most efficient
rebalancing for local data.

4. We should log a reason for choosing a specific rebalancing type due to a
cost.

I'm planning to implement tombstones in the near future. This should remove
the requirement for clearing a partition before full rebalancing in most
cases, improving it's cost.

This can be further enhanced later using some kind of merkle trees, file
based rebalancing or something else.

чт, 16 июл. 2020 г. в 18:33, Ivan Rakov <[email protected]>:

> >
> >  I think we can modify the heuristic so
> > 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD -
> > reduce it to 500)
> > 2) Select only that partition for historical rebalance where difference
> > between counters less that partition size.
>
> Agreed, let's go this way.
>
> On Thu, Jul 16, 2020 at 11:03 AM Vladislav Pyatkov <[email protected]>
> wrote:
>
> > I completely forget about another promise to favor of using historical
> > rebalance where it is possible. When cluster decided to use a full
> balance,
> > demander nodes should clear not empty partitions.
> > This can to consume a long time, in some cases that may be compared with
> a
> > time of rebalance.
> > It also accepts a side of heuristics above.
> >
> > On Thu, Jul 16, 2020 at 12:09 AM Vladislav Pyatkov <[email protected]
> >
> > wrote:
> >
> > > Ivan,
> > >
> > > I agree with a combined approach: threshold for small partitions and
> > count
> > > of update for partition that outgrew it.
> > > This helps to avoid partitions that update not frequently.
> > >
> > > Reading of a big WAL piece (more than 100Gb) it can happen, when a
> client
> > > configured it intentionally.
> > > There are no doubts we can to read it, otherwise WAL space was not
> > > configured that too large.
> > >
> > > I don't see a connection optimization of iterator and issue in atomic
> > > protocol.
> > > Reordering in WAL, that happened in checkpoint where counter was not
> > > changing, is an extremely rare case and the issue will not solve for
> > > generic case, this should be fixed in bound of protocol.
> > >
> > > I think we can modify the heuristic so
> > > 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD
> -
> > > reduce it to 500)
> > > 2) Select only that partition for historical rebalance where difference
> > > between counters less that partition size.
> > >
> > > Also implement mentioned optimization for historical iterator, that may
> > > reduce a time on reading large WAL interval.
> > >
> > > On Wed, Jul 15, 2020 at 3:15 PM Ivan Rakov <[email protected]>
> > wrote:
> > >
> > >> Hi Vladislav,
> > >>
> > >> Thanks for raising this topic.
> > >> Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is
> > 500_000)
> > >> is controversial. Assuming that the default number of partitions is
> > 1024,
> > >> cache should contain a really huge amount of data in order to make WAL
> > >> delta rebalancing possible. In fact, it's currently disabled for most
> > >> production cases, which makes rebalancing of persistent caches
> > >> unreasonably
> > >> long.
> > >>
> > >> I think, your approach [1] makes much more sense than the current
> > >> heuristic, let's move forward with the proposed solution.
> > >>
> > >> Though, there are some other corner cases, e.g. this one:
> > >> - Configured size of WAL archive is big (>100 GB)
> > >> - Cache has small partitions (e.g. 1000 entries)
> > >> - Infrequent updates (e.g. ~100 in the whole WAL history of any node)
> > >> - There is another cache with very frequent updates which allocate
> >99%
> > of
> > >> WAL
> > >> In such scenario we may need to iterate over >100 GB of WAL in order
> to
> > >> fetch <1% of needed updates. Even though the amount of network traffic
> > is
> > >> still optimized, it would be more effective to transfer partitions
> with
> > >> ~1000 entries fully instead of reading >100 GB of WAL.
> > >>
> > >> I want to highlight that your heuristic definitely makes the situation
> > >> better, but due to possible corner cases we should keep the fallback
> > lever
> > >> to restrict or limit historical rebalance as before. Probably, it
> would
> > be
> > >> handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low
> > >> default value (1000, 500 or even 0) and apply your heuristic only for
> > >> partitions with bigger size.
> > >>
> > >> Regarding case [2]: it looks like an improvement that can mitigate
> some
> > >> corner cases (including the one that I have described). I'm ok with it
> > as
> > >> long as it takes data updates reordering on backup nodes into account.
> > We
> > >> don't track skipped updates for atomic caches. As a result, detection
> of
> > >> the absence of updates between two checkpoint markers with the same
> > >> partition counter can be false positive.
> > >>
> > >> --
> > >> Best Regards,
> > >> Ivan Rakov
> > >>
> > >> On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov <
> [email protected]
> > >
> > >> wrote:
> > >>
> > >> > Hi guys,
> > >> >
> > >> > I want to implement a more honest heuristic for historical
> rebalance.
> > >> > Before, a cluster makes a choice between the historical rebalance or
> > >> not it
> > >> > only from a partition size. This threshold more known by a name of
> > >> property
> > >> > IGNITE_PDS_WAL_REBALANCE_THRESHOLD.
> > >> > It might prevent a historical rebalance when a partition is too
> small,
> > >> but
> > >> > not if WAL contains more updates than a size of partition,
> historical
> > >> > rebalance still can be chosen.
> > >> > There is a ticket where need to implement more fair heuristic[1].
> > >> >
> > >> > My idea for implementation is need to estimate a size of data which
> > >> will be
> > >> > transferred owe network. In other word if need to rebalance a part
> of
> > >> WAL
> > >> > that contains N updates, for recover a partition on another node,
> > which
> > >> > have to contain M rows at all, need chooses a historical rebalance
> on
> > >> the
> > >> > case where N < M (WAL history should be presented as well).
> > >> >
> > >> > This approach is easy implemented, because a coordinator node has
> the
> > >> size
> > >> > of partitions and counters' interval. But in this case cluster still
> > can
> > >> > find not many updates in too long WAL history. I assume a
> possibility
> > to
> > >> > work around it, if rebalance historical iterator will not handle
> > >> > checkpoints where not contains updates of particular cache.
> > Checkpoints
> > >> can
> > >> > skip if counters for the cache (maybe even a specific partitions)
> was
> > >> not
> > >> > changed between it and next one.
> > >> >
> > >> > Ticket for improvement rebalance historical iterator[2]
> > >> >
> > >> > I want to hear a view of community on the thought above.
> > >> > Maybe anyone has another opinion?
> > >> >
> > >> > [1]: https://issues.apache.org/jira/browse/IGNITE-13253
> > >> > [2]: https://issues.apache.org/jira/browse/IGNITE-13254
> > >> >
> > >> > --
> > >> > Vladislav Pyatkov
> > >> >
> > >>
> > >
> > >
> > > --
> > > Vladislav Pyatkov
> > >
> >
> >
> > --
> > Vladislav Pyatkov
> >
>


-- 

Best regards,
Alexei Scherbakov

Reply via email to