I completely forget about another promise to favor of using historical
rebalance where it is possible. When cluster decided to use a full balance,
demander nodes should clear not empty partitions.
This can to consume a long time, in some cases that may be compared with a
time of rebalance.
It also accepts a side of heuristics above.

On Thu, Jul 16, 2020 at 12:09 AM Vladislav Pyatkov <[email protected]>
wrote:

> Ivan,
>
> I agree with a combined approach: threshold for small partitions and count
> of update for partition that outgrew it.
> This helps to avoid partitions that update not frequently.
>
> Reading of a big WAL piece (more than 100Gb) it can happen, when a client
> configured it intentionally.
> There are no doubts we can to read it, otherwise WAL space was not
> configured that too large.
>
> I don't see a connection optimization of iterator and issue in atomic
> protocol.
> Reordering in WAL, that happened in checkpoint where counter was not
> changing, is an extremely rare case and the issue will not solve for
> generic case, this should be fixed in bound of protocol.
>
> I think we can modify the heuristic so
> 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD -
> reduce it to 500)
> 2) Select only that partition for historical rebalance where difference
> between counters less that partition size.
>
> Also implement mentioned optimization for historical iterator, that may
> reduce a time on reading large WAL interval.
>
> On Wed, Jul 15, 2020 at 3:15 PM Ivan Rakov <[email protected]> wrote:
>
>> Hi Vladislav,
>>
>> Thanks for raising this topic.
>> Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is 500_000)
>> is controversial. Assuming that the default number of partitions is 1024,
>> cache should contain a really huge amount of data in order to make WAL
>> delta rebalancing possible. In fact, it's currently disabled for most
>> production cases, which makes rebalancing of persistent caches
>> unreasonably
>> long.
>>
>> I think, your approach [1] makes much more sense than the current
>> heuristic, let's move forward with the proposed solution.
>>
>> Though, there are some other corner cases, e.g. this one:
>> - Configured size of WAL archive is big (>100 GB)
>> - Cache has small partitions (e.g. 1000 entries)
>> - Infrequent updates (e.g. ~100 in the whole WAL history of any node)
>> - There is another cache with very frequent updates which allocate >99% of
>> WAL
>> In such scenario we may need to iterate over >100 GB of WAL in order to
>> fetch <1% of needed updates. Even though the amount of network traffic is
>> still optimized, it would be more effective to transfer partitions with
>> ~1000 entries fully instead of reading >100 GB of WAL.
>>
>> I want to highlight that your heuristic definitely makes the situation
>> better, but due to possible corner cases we should keep the fallback lever
>> to restrict or limit historical rebalance as before. Probably, it would be
>> handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low
>> default value (1000, 500 or even 0) and apply your heuristic only for
>> partitions with bigger size.
>>
>> Regarding case [2]: it looks like an improvement that can mitigate some
>> corner cases (including the one that I have described). I'm ok with it as
>> long as it takes data updates reordering on backup nodes into account. We
>> don't track skipped updates for atomic caches. As a result, detection of
>> the absence of updates between two checkpoint markers with the same
>> partition counter can be false positive.
>>
>> --
>> Best Regards,
>> Ivan Rakov
>>
>> On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov <[email protected]>
>> wrote:
>>
>> > Hi guys,
>> >
>> > I want to implement a more honest heuristic for historical rebalance.
>> > Before, a cluster makes a choice between the historical rebalance or
>> not it
>> > only from a partition size. This threshold more known by a name of
>> property
>> > IGNITE_PDS_WAL_REBALANCE_THRESHOLD.
>> > It might prevent a historical rebalance when a partition is too small,
>> but
>> > not if WAL contains more updates than a size of partition, historical
>> > rebalance still can be chosen.
>> > There is a ticket where need to implement more fair heuristic[1].
>> >
>> > My idea for implementation is need to estimate a size of data which
>> will be
>> > transferred owe network. In other word if need to rebalance a part of
>> WAL
>> > that contains N updates, for recover a partition on another node, which
>> > have to contain M rows at all, need chooses a historical rebalance on
>> the
>> > case where N < M (WAL history should be presented as well).
>> >
>> > This approach is easy implemented, because a coordinator node has the
>> size
>> > of partitions and counters' interval. But in this case cluster still can
>> > find not many updates in too long WAL history. I assume a possibility to
>> > work around it, if rebalance historical iterator will not handle
>> > checkpoints where not contains updates of particular cache. Checkpoints
>> can
>> > skip if counters for the cache (maybe even a specific partitions) was
>> not
>> > changed between it and next one.
>> >
>> > Ticket for improvement rebalance historical iterator[2]
>> >
>> > I want to hear a view of community on the thought above.
>> > Maybe anyone has another opinion?
>> >
>> > [1]: https://issues.apache.org/jira/browse/IGNITE-13253
>> > [2]: https://issues.apache.org/jira/browse/IGNITE-13254
>> >
>> > --
>> > Vladislav Pyatkov
>> >
>>
>
>
> --
> Vladislav Pyatkov
>


-- 
Vladislav Pyatkov

Reply via email to