Re: Extended logging for rebalance performance analysis

ткаленко кирилл Wed, 20 May 2020 12:27:10 -0700

Hi, Maxim! Unfortunately, I delayed my answer.

The production environment may not be ideal and problems may occur in it. I 
only suggest adding a little logging to understand some of the cases that may 
occur during operation. In addition, this functionality can be disabled by the 
system property. Rename variables to improve code readability.


We can do it via metrics, but whether need to store it for all suppliers and 
how long is unclear. In my case, I do a simple aggregation and clear it after 
output.


06.05.2020, 21:01, "Maxim Muzafarov" <[email protected]>:
> Kirill,
>
> Thank you for raising this topic. It's true that the rebalance process
> still requires additional information for analyzing issues. Please,
> don't think that I'm against your changes :-)
>
> * My short answer. *
>
> We won't do performance analysis on the production environment. Each
> time we need performance analysis it will be done on a test
> environment with verbose logging enabled. Thus I suggest moving these
> changes to a separate `profiling` module and extend the logging much
> more without any ышяу limitations. The same as these [2] [3]
> activities do.
>
> Let's keep the `core` module as simple as possible.
> Let's design the right API for accessing rebalance internals for
> profiling tools.
> Can you, please, remove all changes from your PR [6] which are not
> related to the proposed topic? (e.g. variable renamings).
>
> * The long answer. *
>
> Here are my thoughts on this. There are two different types of issues
> in the rebalance process. The first case must be covered by daily
> monitoring subsystems, the second case must be covered by additional
> profiling tools.
> 1. errors during the rebalancing (e.g. rebalance not happens when required)
> 2. performance rebalancing issues (e.g. rebalance is slow)
>
> Daily monitoring tools (JMX, Logging) are always turned on and
> shouldn't require additional systems resources themselves. Since these
> metrics must be lightweight any internal aggregation machinery is not
> used on them. All these metrics are collected from each node
> independently. Please, take a look at this issue [1] which covers most
> of your questions mentioned above.
>
> For all available metrics, we can configure LogExporterSpi, so they
> will be available in logs.
>
>>  1)How long was the balance(divided by supplier)?
>
> rebalancingStartTime, rebalancingEndTime already exists for cache groups [4].
> We can add the same for suppliers.
>
>>  2)How many records and bytes per supplier were rebalanced?
>
> We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4]
> We will have rebalancingExpectedKeys [5].
> We can add a new metric per cache keys, per supplier.
>
>>  3)How many times did the rebalance restart?
>
> rebalancingLastCancelledTime [4] metric exists.
> Do we need to keep historical data on it?
>
>>  4)Which partitions were rebalanced and from which nodes did they receive 
>> them?
>
> Let's print this information to log prior to the rebalance process
> starts. This will be helpful information and do not require a lot of
> changes.
>
>>  5)When did rebalance for all cache groups end?
>
> This metric may be aggregated from rebalancingEndTime [4] by pulling
> it from all nodes for all caches.
>
> Performance rebalancing issues are related to profiling tools. They
> may require additional system resources and definitely require a
> dedicated environment for tests. We can't do performance analysis on
> production environments due to performance impact.
> I see some disadvantages of adding such tools to production code:
> - verbose logging may affect performance.
> - the problematic process may become even worse if an automatic
> threshold suddenly turns on.
> - new code changes will require additional efforts to keep logging up-to-date.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-12183
> [2] https://issues.apache.org/jira/browse/IGNITE-12666
> [3] 
> https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool
> [4] https://issues.apache.org/jira/browse/IGNITE-12193
> [5] https://issues.apache.org/jira/browse/IGNITE-12194
> [6] https://github.com/apache/ignite/pull/7705/files
>
> On Wed, 6 May 2020 at 19:50, Ivan Rakov <[email protected]> wrote:
>>  Hi,
>>
>>  IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
>>  > duration rebalance of cache group after which partitions distribution is
>>  > output, set in milliseconds, default value is 10 minutes.
>>
>>   Does it mean that if the rebalancing process took less than 10 minutes,
>>  only a short version of the message (with supplier statistics) will show up?
>>
>>  In general, I have no objections.
>>
>>  On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл <[email protected]>
>>  wrote:
>>
>>  > Hi, Igniters!
>>  >
>>  > I'd like to share a new small feature in AI [1].
>>  >
>>  > Current rebalance logging does not allow you to quickly answer following
>>  > questions:
>>  > 1)How long was the balance(divided by supplier)?
>>  > 2)How many records and bytes per supplier were rebalanced?
>>  > 3)How many times did rebalance restart?
>>  > 4)Which partitions were rebalanced and from which nodes did they receive
>>  > them?
>>  > 5)When did rebalance for all cache groups end?
>>  >
>>  > What you can see in logs now:
>>  >
>>  > 1)Starting rebalance with order of cache groups.
>>  > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
>>  > top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
>>  > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]
>>  >
>>  > 2)Start rebalance of cache group from a specific supplier, specifying
>>  > partition ids and mode - historical or full.
>>  > Starting rebalance routine [ignite-sys-cache,
>>  > topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
>>  > supplier=8c525892-703b-4fc4-b28b-b2f139700000, fullPartitions=[0-99],
>>  > histPartitions=[]]
>>  >
>>  > 3)Getting partial or complete partitions of cache group.
>>  > Completed rebalancing [grp=ignite-sys-cache,
>>  > supplier=8c525892-703b-4fc4-b28b-b2f139700000,
>>  > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2]
>>  > Completed (final) rebalancing [grp=ignite-sys-cache,
>>  > supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001,
>>  > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2]
>>  >
>>  > 4)End rebalance of cache group.
>>  > Completed rebalance future: RebalanceFuture [grp=CacheGroupContext
>>  > [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
>>  > minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
>>  > receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1,
>>  > lastCancelledTime=-1]
>>  >
>>  > Rebalance statistics:
>>  >
>>  > To speed up rebalance analysis, statistics will be output for each cache
>>  > group and total for all cache groups.
>>  > If duration rebalance for cache group is greater than threshold value,
>>  > partition distribution is output.
>>  > Statistics will you to analyze duration of the balance for each supplier
>>  > to understand which of them has been transmitting data for longest time.
>>  >
>>  > System properties are used to output statistics:
>>  >
>>  > IGNITE_QUIET - to output statistics, value must be false;
>>  > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
>>  > duration rebalance of cache group after which partitions distribution is
>>  > output, set in milliseconds, default value is 10 minutes.
>>  >
>>  > Statistics examples:
>>  >
>>  > Successful full and historical rebalance of group cache, without
>>  > partitions distribution.
>>  > Rebalance information per cache group (successful rebalance): [id=3181548,
>>  > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
>>  > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
>>  > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b 
>> -
>>  > bytes, d - duration, h - historical, nodeId mapping
>>  > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  > [1=rebalancing.RebalanceStatisticsTest0]
>>  > Rebalance information per cache group (successful rebalance): [id=3181547,
>>  > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
>>  > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, 
>> hp=10,
>>  > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
>>  > bytes, d - duration, h - historical, nodeId mapping
>>  > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>>  >
>>  > Successful full and historical rebalance of group cache, with partitions
>>  > distribution.
>>  > Rebalance information per cache group (successful rebalance): [id=3181548,
>>  > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
>>  > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
>>  > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b 
>> -
>>  > bytes, d - duration, h - historical, nodeId mapping
>>  > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  > [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater
>>  > than 5 ms, printing detailed information about partitions distribution
>>  > (threshold can be changed by setting number of milliseconds into
>>  > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
>>  > [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
>>  > [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
>>  > [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
>>  > [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 =
>>  > [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] Aliases: pr -
>>  > primary, bu - backup, su - supplier node, h - historical, nodeId mapping
>>  > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  > [1=rebalancing.RebalanceStatisticsTest2]
>>  > [2=rebalancing.RebalanceStatisticsTest0]
>>  > Rebalance information per cache group (successful rebalance): [id=3181547,
>>  > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
>>  > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, 
>> hp=10,
>>  > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
>>  > bytes, d - duration, h - historical, nodeId mapping
>>  > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] 
>> Rebalance
>>  > duration was greater than 100 ms, printing detailed information about
>>  > partitions distribution (threshold can be changed by setting number of
>>  > milliseconds into IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD)
>>  > 0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 =
>>  > [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 =
>>  > [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 =
>>  > [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 =
>>  > [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases: pr -
>>  > primary, bu - backup, su - supplier node, h - historical, nodeId mapping
>>  > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>>  > [1=rebalancing.RebalanceStatisticsTest2]
>>  > [2=rebalancing.RebalanceStatisticsTest1]
>>  >
>>  > Interrupted rebalance of group cache.
>>  > Rebalance information per cache group (interrupted rebalance):
>>  > [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969,
>>  > finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0]
>>  >
>>  > Total full and historical rebalance for all cache groups.
>>  > Rebalance total information (including successful and not rebalances):
>>  > [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13 10:55:18,780,
>>  > d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000, d=54 ms]
>>  > [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p - partitions, e -
>>  > entries, b - bytes, d - duration, h - historical, nodeId mapping
>>  > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  > [1=rebalancing.RebalanceStatisticsTest0]
>>  > Rebalance total information (including successful and not rebalances):
>>  > [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13 15:01:44,116,
>>  > d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500, hb=50445, d=294
>>  > ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h -
>>  > historical, nodeId mapping (nodeId=id,consistentId)
>>  > [0=rebalancing.RebalanceStatisticsTest0]
>>  >
>>  > [1] - https://issues.apache.org/jira/browse/IGNITE-12080

Re: Extended logging for rebalance performance analysis

Reply via email to