Re: Extended logging for rebalance performance analysis

Alexei Scherbakov Tue, 23 Jun 2020 08:21:23 -0700

Hi, Kirill.

Looks good to me.


вт, 23 июн. 2020 г. в 18:05, ткаленко кирилл <[email protected]>:

> Hello, Alexey!
>
> I suggest that we decide what we can do within ticket [1].
>
> Add "rebalanceId" and "topVer" related to rebalance to all messages.
>
> Add statistical information to a log message:
> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing
> [rebalanceId=1, grp=cache1,
> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, partitions=5, entries=100,
> duration=12ms,
> bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> progress=1/2]
>
> Add a message to log after rebalancing for all cache groups is complete:
> [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> [rebalanceId=2, partitions=10, entries=200, duration=50ms, bytesRcvd=10M]
>
> Any comments or suggestions?
>
> [1] - https://issues.apache.org/jira/browse/IGNITE-12080
>
> 20.05.2020, 23:08, "ткаленко кирилл" <[email protected]>:
> > Hello, Alexey! Unfortunately, my response was delayed.
> >
> > Point 2: You can do as you suggested, I think it is still worth
> specifying how many partitions were obtained.
> >
> > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, partitions=5,
> entries=100, duration=12ms,
> > bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> > progress=1/2]
> >
> > Point 3: is It "rebalanceId"?
> >
> > Point 5: I think we can output a summary for each supplier, so as not to
> keep it in mind.
> >
> > [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> > [rebalanceId=2, [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
> partitions=5, entries=100, duration=12ms, bytesRcvd=5M],
> > [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00002, partitions=5,
> entries=100, duration=12ms, bytesRcvd=5M]]
> >
> > I can add "rebalanceId" to each message that you gave at above.
> >
> > A detailed message will help us understand how correctly the suppliers
> were selected.
> >
> > 06.05.2020, 22:08, "Alexei Scherbakov" <[email protected]>:
> >>  Hello.
> >>
> >>  Let's look at existing rebalancing log for a single group:
> >>
> >>  [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled
> >>  [order=[ignite-sys-cache, cache1, cache2, default],
> >>  top=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> >>  evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d00002]
> >>  ...
> >>  [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
> >>  mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
> >>  partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
> >>  minorTopVer=1]]
> >>  [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
> >>  mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
> >>  partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
> >>  minorTopVer=1]]
> >>  [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine
> [cache1,
> >>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> >>  supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, fullPartitions=[1, 5,
> 7, 9,
> >>  11, 13, 15, 23, 27, 29, 31], histPartitions=[]]
> >>  [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine
> [cache1,
> >>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> >>  supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000, fullPartitions=[6, 8,
> 10,
> >>  16, 18, 20, 22, 24, 26, 28], histPartitions=[]]
> >>  [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing
> [grp=cache1,
> >>  supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
> >>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2]
> >>  [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing
> >>  [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
> >>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2]
> >>  [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future:
> >>  RebalanceFuture [grp=CacheGroupContext [grp=cache1],
> >>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> rebalanceId=2,
> >>  routines=2]
> >>
> >>  From these logs I'm already can get answers to 1 and 4.
> >>  The logs look concise and easy to read and understand, and should
> >>  remain what way.
> >>
> >>  But I think some proposed improvements can be done here without harm.
> >>
> >>  2. OK, let's add it to supplier info per cache with additional info:
> >>
> >>  [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing
> [grp=cache1,
> >>  supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, entries=100,
> duration=12ms,
> >>  bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> >>  progress=1/2]
> >>
> >>  3. This information is already printed to log.
> >>
> >>  5. OK, let's add a line which is printed after all groups in the chain
> are
> >>  rebalanced or cancelled with a summary:
> >>
> >>  [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> >>  [rebalanceId=2, entries=200, duration=50ms, bytesRcvd=10M]
> >>
> >>  Another thing I would suggest: add rebalanceId to all messages for
> ease of
> >>  grepping and to distinguish multiple rebalances for a same topology
> version.
> >>
> >>  Regarding the "detailed" message - I wasn't able to figure out any case
> >>  where it is useful. Not sure if we need it at all.
> >>  Kirill Tkalenko, can you give me an example?
> >>
> >>  ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <[email protected]>:
> >>
> >>>   Kirill,
> >>>
> >>>   Thank you for raising this topic. It's true that the rebalance
> process
> >>>   still requires additional information for analyzing issues. Please,
> >>>   don't think that I'm against your changes :-)
> >>>
> >>>   * My short answer. *
> >>>
> >>>   We won't do performance analysis on the production environment. Each
> >>>   time we need performance analysis it will be done on a test
> >>>   environment with verbose logging enabled. Thus I suggest moving these
> >>>   changes to a separate `profiling` module and extend the logging much
> >>>   more without any ышяу limitations. The same as these [2] [3]
> >>>   activities do.
> >>>
> >>>   Let's keep the `core` module as simple as possible.
> >>>   Let's design the right API for accessing rebalance internals for
> >>>   profiling tools.
> >>>   Can you, please, remove all changes from your PR [6] which are not
> >>>   related to the proposed topic? (e.g. variable renamings).
> >>>
> >>>   * The long answer. *
> >>>
> >>>   Here are my thoughts on this. There are two different types of issues
> >>>   in the rebalance process. The first case must be covered by daily
> >>>   monitoring subsystems, the second case must be covered by additional
> >>>   profiling tools.
> >>>   1. errors during the rebalancing (e.g. rebalance not happens when
> required)
> >>>   2. performance rebalancing issues (e.g. rebalance is slow)
> >>>
> >>>   Daily monitoring tools (JMX, Logging) are always turned on and
> >>>   shouldn't require additional systems resources themselves. Since
> these
> >>>   metrics must be lightweight any internal aggregation machinery is not
> >>>   used on them. All these metrics are collected from each node
> >>>   independently. Please, take a look at this issue [1] which covers
> most
> >>>   of your questions mentioned above.
> >>>
> >>>   For all available metrics, we can configure LogExporterSpi, so they
> >>>   will be available in logs.
> >>>
> >>>   > 1)How long was the balance(divided by supplier)?
> >>>   rebalancingStartTime, rebalancingEndTime already exists for cache
> groups
> >>>   [4].
> >>>   We can add the same for suppliers.
> >>>
> >>>   > 2)How many records and bytes per supplier were rebalanced?
> >>>   We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4]
> >>>   We will have rebalancingExpectedKeys [5].
> >>>   We can add a new metric per cache keys, per supplier.
> >>>
> >>>   > 3)How many times did the rebalance restart?
> >>>   rebalancingLastCancelledTime [4] metric exists.
> >>>   Do we need to keep historical data on it?
> >>>
> >>>   > 4)Which partitions were rebalanced and from which nodes did they
> receive
> >>>   them?
> >>>   Let's print this information to log prior to the rebalance process
> >>>   starts. This will be helpful information and do not require a lot of
> >>>   changes.
> >>>
> >>>   > 5)When did rebalance for all cache groups end?
> >>>   This metric may be aggregated from rebalancingEndTime [4] by pulling
> >>>   it from all nodes for all caches.
> >>>
> >>>   Performance rebalancing issues are related to profiling tools. They
> >>>   may require additional system resources and definitely require a
> >>>   dedicated environment for tests. We can't do performance analysis on
> >>>   production environments due to performance impact.
> >>>   I see some disadvantages of adding such tools to production code:
> >>>   - verbose logging may affect performance.
> >>>   - the problematic process may become even worse if an automatic
> >>>   threshold suddenly turns on.
> >>>   - new code changes will require additional efforts to keep logging
> >>>   up-to-date.
> >>>
> >>>   [1] https://issues.apache.org/jira/browse/IGNITE-12183
> >>>   [2] https://issues.apache.org/jira/browse/IGNITE-12666
> >>>   [3]
> >>>
> https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool
> >>>   [4] https://issues.apache.org/jira/browse/IGNITE-12193
> >>>   [5] https://issues.apache.org/jira/browse/IGNITE-12194
> >>>   [6] https://github.com/apache/ignite/pull/7705/files
> >>>
> >>>   On Wed, 6 May 2020 at 19:50, Ivan Rakov <[email protected]>
> wrote:
> >>>   >
> >>>   > Hi,
> >>>   >
> >>>   > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> >>>   > > duration rebalance of cache group after which partitions
> distribution
> >>>   is
> >>>   > > output, set in milliseconds, default value is 10 minutes.
> >>>   >
> >>>   > Does it mean that if the rebalancing process took less than 10
> minutes,
> >>>   > only a short version of the message (with supplier statistics)
> will show
> >>>   up?
> >>>   >
> >>>   > In general, I have no objections.
> >>>   >
> >>>   >
> >>>   > On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл <
> [email protected]>
> >>>   > wrote:
> >>>   >
> >>>   > > Hi, Igniters!
> >>>   > >
> >>>   > > I'd like to share a new small feature in AI [1].
> >>>   > >
> >>>   > > Current rebalance logging does not allow you to quickly answer
> >>>   following
> >>>   > > questions:
> >>>   > > 1)How long was the balance(divided by supplier)?
> >>>   > > 2)How many records and bytes per supplier were rebalanced?
> >>>   > > 3)How many times did rebalance restart?
> >>>   > > 4)Which partitions were rebalanced and from which nodes did they
> >>>   receive
> >>>   > > them?
> >>>   > > 5)When did rebalance for all cache groups end?
> >>>   > >
> >>>   > > What you can see in logs now:
> >>>   > >
> >>>   > > 1)Starting rebalance with order of cache groups.
> >>>   > > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
> >>>   > > top=AffinityTopologyVersion [topVer=2, minorTopVer=0],
> force=false,
> >>>   > > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]
> >>>   > >
> >>>   > > 2)Start rebalance of cache group from a specific supplier,
> specifying
> >>>   > > partition ids and mode - historical or full.
> >>>   > > Starting rebalance routine [ignite-sys-cache,
> >>>   > > topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
> >>>   > > supplier=8c525892-703b-4fc4-b28b-b2f139700000,
> fullPartitions=[0-99],
> >>>   > > histPartitions=[]]
> >>>   > >
> >>>   > > 3)Getting partial or complete partitions of cache group.
> >>>   > > Completed rebalancing [grp=ignite-sys-cache,
> >>>   > > supplier=8c525892-703b-4fc4-b28b-b2f139700000,
> >>>   > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0],
> progress=1/2]
> >>>   > > Completed (final) rebalancing [grp=ignite-sys-cache,
> >>>   > > supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001,
> >>>   > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0],
> progress=2/2]
> >>>   > >
> >>>   > > 4)End rebalance of cache group.
> >>>   > > Completed rebalance future: RebalanceFuture
> [grp=CacheGroupContext
> >>>   > > [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
> >>>   > > minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
> >>>   > > receivedKeys=0, partitionsLeft=0, startTime=1588519707607,
> endTime=-1,
> >>>   > > lastCancelledTime=-1]
> >>>   > >
> >>>   > > Rebalance statistics:
> >>>   > >
> >>>   > > To speed up rebalance analysis, statistics will be output for
> each
> >>>   cache
> >>>   > > group and total for all cache groups.
> >>>   > > If duration rebalance for cache group is greater than threshold
> value,
> >>>   > > partition distribution is output.
> >>>   > > Statistics will you to analyze duration of the balance for each
> >>>   supplier
> >>>   > > to understand which of them has been transmitting data for
> longest
> >>>   time.
> >>>   > >
> >>>   > > System properties are used to output statistics:
> >>>   > >
> >>>   > > IGNITE_QUIET - to output statistics, value must be false;
> >>>   > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD -
> threshold
> >>>   > > duration rebalance of cache group after which partitions
> distribution
> >>>   is
> >>>   > > output, set in milliseconds, default value is 10 minutes.
> >>>   > >
> >>>   > > Statistics examples:
> >>>   > >
> >>>   > > Successful full and historical rebalance of group cache, without
> >>>   > > partitions distribution.
> >>>   > > Rebalance information per cache group (successful rebalance):
> >>>   [id=3181548,
> >>>   > > name=grp1, startTime=2020-04-13 10:55:16,117,
> finishTime=2020-04-13
> >>>   > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics:
> [nodeId=0,
> >>>   p=5,
> >>>   > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e -
> >>>   entries, b -
> >>>   > > bytes, d - duration, h - historical, nodeId mapping
> >>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>>   > > [1=rebalancing.RebalanceStatisticsTest0]
> >>>   > > Rebalance information per cache group (successful rebalance):
> >>>   [id=3181547,
> >>>   > > name=grp0, startTime=2020-04-13 15:01:44,000,
> finishTime=2020-04-13
> >>>   > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics:
> [nodeId=0,
> >>>   hp=10,
> >>>   > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e -
> entries, b -
> >>>   > > bytes, d - duration, h - historical, nodeId mapping
> >>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> >>>   > >
> >>>   > > Successful full and historical rebalance of group cache, with
> >>>   partitions
> >>>   > > distribution.
> >>>   > > Rebalance information per cache group (successful rebalance):
> >>>   [id=3181548,
> >>>   > > name=grp1, startTime=2020-04-13 10:55:16,117,
> finishTime=2020-04-13
> >>>   > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics:
> [nodeId=0,
> >>>   p=5,
> >>>   > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e -
> >>>   entries, b -
> >>>   > > bytes, d - duration, h - historical, nodeId mapping
> >>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>>   > > [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was
> greater
> >>>   > > than 5 ms, printing detailed information about partitions
> distribution
> >>>   > > (threshold can be changed by setting number of milliseconds into
> >>>   > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
> >>>   > > [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
> >>>   > > [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
> >>>   > > [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
> >>>   > > [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 =
> >>>   > > [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su]
> Aliases: pr -
> >>>   > > primary, bu - backup, su - supplier node, h - historical, nodeId
> >>>   mapping
> >>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>>   > > [1=rebalancing.RebalanceStatisticsTest2]
> >>>   > > [2=rebalancing.RebalanceStatisticsTest0]
> >>>   > > Rebalance information per cache group (successful rebalance):
> >>>   [id=3181547,
> >>>   > > name=grp0, startTime=2020-04-13 15:01:44,000,
> finishTime=2020-04-13
> >>>   > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics:
> [nodeId=0,
> >>>   hp=10,
> >>>   > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e -
> entries, b -
> >>>   > > bytes, d - duration, h - historical, nodeId mapping
> >>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> >>>   Rebalance
> >>>   > > duration was greater than 100 ms, printing detailed information
> about
> >>>   > > partitions distribution (threshold can be changed by setting
> number of
> >>>   > > milliseconds into
> >>>   IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD)
> >>>   > > 0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 =
> >>>   > > [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 =
> >>>   > > [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 =
> >>>   > > [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 =
> >>>   > > [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases:
> pr -
> >>>   > > primary, bu - backup, su - supplier node, h - historical, nodeId
> >>>   mapping
> >>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> >>>   > > [1=rebalancing.RebalanceStatisticsTest2]
> >>>   > > [2=rebalancing.RebalanceStatisticsTest1]
> >>>   > >
> >>>   > > Interrupted rebalance of group cache.
> >>>   > > Rebalance information per cache group (interrupted rebalance):
> >>>   > > [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969,
> >>>   > > finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0]
> >>>   > >
> >>>   > > Total full and historical rebalance for all cache groups.
> >>>   > > Rebalance total information (including successful and not
> rebalances):
> >>>   > > [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13
> 10:55:18,780,
> >>>   > > d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000,
> d=54 ms]
> >>>   > > [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p -
> partitions, e -
> >>>   > > entries, b - bytes, d - duration, h - historical, nodeId mapping
> >>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>>   > > [1=rebalancing.RebalanceStatisticsTest0]
> >>>   > > Rebalance total information (including successful and not
> rebalances):
> >>>   > > [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13
> 15:01:44,116,
> >>>   > > d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500,
> hb=50445,
> >>>   d=294
> >>>   > > ms] Aliases: p - partitions, e - entries, b - bytes, d -
> duration, h -
> >>>   > > historical, nodeId mapping (nodeId=id,consistentId)
> >>>   > > [0=rebalancing.RebalanceStatisticsTest0]
> >>>   > >
> >>>   > > [1] - https://issues.apache.org/jira/browse/IGNITE-12080
> >>
> >>  --
> >>
> >>  Best regards,
> >>  Alexei Scherbakov
>


-- 

Best regards,
Alexei Scherbakov

Re: Extended logging for rebalance performance analysis

Reply via email to