Re: Extended logging for rebalance performance analysis

Alexei Scherbakov Tue, 26 May 2020 05:06:02 -0700

Hi, Kirill.

2. Ok for me.


3. We already have log message about rebalance cancellation for specific
rebalanceId and log message about new rebalancing has started with next
rebalanceId.

5. We already have summary information per group for each supplier added in
p.2
I would avoid duplication here or put it under debug logging.

Information about selected suppliers can already be obtained from logs:
[2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1]]
[2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1]]

so I still do not see any value in "detailed" message.

I think this is enough to understand rebalancing flow using proper grepping
by topVer and rebalanceId.

All additional aggregation should be done by external tools using
rebalancing metrics.
I'm on the same page with Maxim Muzafarov here.



ср, 20 мая 2020 г. в 23:08, ткаленко кирилл <[email protected]>:

> Hello, Alexey! Unfortunately, my response was delayed.
>
> Point 2: You can do as you suggested, I think it is still worth specifying
> how many partitions were obtained.
>
> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, partitions=5, entries=100,
> duration=12ms,
> bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> progress=1/2]
>
> Point 3: is It "rebalanceId"?
>
> Point 5: I think we can output a summary for each supplier, so as not to
> keep it in mind.
>
> [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> [rebalanceId=2, [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
> partitions=5, entries=100, duration=12ms, bytesRcvd=5M],
> [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00002, partitions=5, entries=100,
> duration=12ms, bytesRcvd=5M]]
>
> I can add "rebalanceId" to each message that you gave at above.
>
> A detailed message will help us understand how correctly the suppliers
> were selected.
>
> 06.05.2020, 22:08, "Alexei Scherbakov" <[email protected]>:
> > Hello.
> >
> > Let's look at existing rebalancing log for a single group:
> >
> > [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled
> > [order=[ignite-sys-cache, cache1, cache2, default],
> > top=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> > evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d00002]
> > ...
> > [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
> > mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
> > partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
> > minorTopVer=1]]
> > [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
> > mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
> > partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
> > minorTopVer=1]]
> > [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine [cache1,
> > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, fullPartitions=[1, 5, 7,
> 9,
> > 11, 13, 15, 23, 27, 29, 31], histPartitions=[]]
> > [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine [cache1,
> > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> > supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000, fullPartitions=[6, 8, 10,
> > 16, 18, 20, 22, 24, 26, 28], histPartitions=[]]
> > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
> > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2]
> > [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing
> > [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
> > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2]
> > [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future:
> > RebalanceFuture [grp=CacheGroupContext [grp=cache1],
> > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], rebalanceId=2,
> > routines=2]
> >
> > From these logs I'm already can get answers to 1 and 4.
> > The logs look concise and easy to read and understand, and should
> > remain what way.
> >
> > But I think some proposed improvements can be done here without harm.
> >
> > 2. OK, let's add it to supplier info per cache with additional info:
> >
> > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, entries=100,
> duration=12ms,
> > bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> > progress=1/2]
> >
> > 3. This information is already printed to log.
> >
> > 5. OK, let's add a line which is printed after all groups in the chain
> are
> > rebalanced or cancelled with a summary:
> >
> > [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> > [rebalanceId=2, entries=200, duration=50ms, bytesRcvd=10M]
> >
> > Another thing I would suggest: add rebalanceId to all messages for ease
> of
> > grepping and to distinguish multiple rebalances for a same topology
> version.
> >
> > Regarding the "detailed" message - I wasn't able to figure out any case
> > where it is useful. Not sure if we need it at all.
> > Kirill Tkalenko, can you give me an example?
> >
> > ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <[email protected]>:
> >
> >>  Kirill,
> >>
> >>  Thank you for raising this topic. It's true that the rebalance process
> >>  still requires additional information for analyzing issues. Please,
> >>  don't think that I'm against your changes :-)
> >>
> >>  * My short answer. *
> >>
> >>  We won't do performance analysis on the production environment. Each
> >>  time we need performance analysis it will be done on a test
> >>  environment with verbose logging enabled. Thus I suggest moving these
> >>  changes to a separate `profiling` module and extend the logging much
> >>  more without any ышяу limitations. The same as these [2] [3]
> >>  activities do.
> >>
> >>  Let's keep the `core` module as simple as possible.
> >>  Let's design the right API for accessing rebalance internals for
> >>  profiling tools.
> >>  Can you, please, remove all changes from your PR [6] which are not
> >>  related to the proposed topic? (e.g. variable renamings).
> >>
> >>  * The long answer. *
> >>
> >>  Here are my thoughts on this. There are two different types of issues
> >>  in the rebalance process. The first case must be covered by daily
> >>  monitoring subsystems, the second case must be covered by additional
> >>  profiling tools.
> >>  1. errors during the rebalancing (e.g. rebalance not happens when
> required)
> >>  2. performance rebalancing issues (e.g. rebalance is slow)
> >>
> >>  Daily monitoring tools (JMX, Logging) are always turned on and
> >>  shouldn't require additional systems resources themselves. Since these
> >>  metrics must be lightweight any internal aggregation machinery is not
> >>  used on them. All these metrics are collected from each node
> >>  independently. Please, take a look at this issue [1] which covers most
> >>  of your questions mentioned above.
> >>
> >>  For all available metrics, we can configure LogExporterSpi, so they
> >>  will be available in logs.
> >>
> >>  > 1)How long was the balance(divided by supplier)?
> >>  rebalancingStartTime, rebalancingEndTime already exists for cache
> groups
> >>  [4].
> >>  We can add the same for suppliers.
> >>
> >>  > 2)How many records and bytes per supplier were rebalanced?
> >>  We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4]
> >>  We will have rebalancingExpectedKeys [5].
> >>  We can add a new metric per cache keys, per supplier.
> >>
> >>  > 3)How many times did the rebalance restart?
> >>  rebalancingLastCancelledTime [4] metric exists.
> >>  Do we need to keep historical data on it?
> >>
> >>  > 4)Which partitions were rebalanced and from which nodes did they
> receive
> >>  them?
> >>  Let's print this information to log prior to the rebalance process
> >>  starts. This will be helpful information and do not require a lot of
> >>  changes.
> >>
> >>  > 5)When did rebalance for all cache groups end?
> >>  This metric may be aggregated from rebalancingEndTime [4] by pulling
> >>  it from all nodes for all caches.
> >>
> >>  Performance rebalancing issues are related to profiling tools. They
> >>  may require additional system resources and definitely require a
> >>  dedicated environment for tests. We can't do performance analysis on
> >>  production environments due to performance impact.
> >>  I see some disadvantages of adding such tools to production code:
> >>  - verbose logging may affect performance.
> >>  - the problematic process may become even worse if an automatic
> >>  threshold suddenly turns on.
> >>  - new code changes will require additional efforts to keep logging
> >>  up-to-date.
> >>
> >>  [1] https://issues.apache.org/jira/browse/IGNITE-12183
> >>  [2] https://issues.apache.org/jira/browse/IGNITE-12666
> >>  [3]
> >>
> https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool
> >>  [4] https://issues.apache.org/jira/browse/IGNITE-12193
> >>  [5] https://issues.apache.org/jira/browse/IGNITE-12194
> >>  [6] https://github.com/apache/ignite/pull/7705/files
> >>
> >>  On Wed, 6 May 2020 at 19:50, Ivan Rakov <[email protected]> wrote:
> >>  >
> >>  > Hi,
> >>  >
> >>  > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> >>  > > duration rebalance of cache group after which partitions
> distribution
> >>  is
> >>  > > output, set in milliseconds, default value is 10 minutes.
> >>  >
> >>  > Does it mean that if the rebalancing process took less than 10
> minutes,
> >>  > only a short version of the message (with supplier statistics) will
> show
> >>  up?
> >>  >
> >>  > In general, I have no objections.
> >>  >
> >>  >
> >>  > On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл <
> [email protected]>
> >>  > wrote:
> >>  >
> >>  > > Hi, Igniters!
> >>  > >
> >>  > > I'd like to share a new small feature in AI [1].
> >>  > >
> >>  > > Current rebalance logging does not allow you to quickly answer
> >>  following
> >>  > > questions:
> >>  > > 1)How long was the balance(divided by supplier)?
> >>  > > 2)How many records and bytes per supplier were rebalanced?
> >>  > > 3)How many times did rebalance restart?
> >>  > > 4)Which partitions were rebalanced and from which nodes did they
> >>  receive
> >>  > > them?
> >>  > > 5)When did rebalance for all cache groups end?
> >>  > >
> >>  > > What you can see in logs now:
> >>  > >
> >>  > > 1)Starting rebalance with order of cache groups.
> >>  > > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
> >>  > > top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
> >>  > > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]
> >>  > >
> >>  > > 2)Start rebalance of cache group from a specific supplier,
> specifying
> >>  > > partition ids and mode - historical or full.
> >>  > > Starting rebalance routine [ignite-sys-cache,
> >>  > > topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
> >>  > > supplier=8c525892-703b-4fc4-b28b-b2f139700000,
> fullPartitions=[0-99],
> >>  > > histPartitions=[]]
> >>  > >
> >>  > > 3)Getting partial or complete partitions of cache group.
> >>  > > Completed rebalancing [grp=ignite-sys-cache,
> >>  > > supplier=8c525892-703b-4fc4-b28b-b2f139700000,
> >>  > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0],
> progress=1/2]
> >>  > > Completed (final) rebalancing [grp=ignite-sys-cache,
> >>  > > supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001,
> >>  > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0],
> progress=2/2]
> >>  > >
> >>  > > 4)End rebalance of cache group.
> >>  > > Completed rebalance future: RebalanceFuture [grp=CacheGroupContext
> >>  > > [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
> >>  > > minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
> >>  > > receivedKeys=0, partitionsLeft=0, startTime=1588519707607,
> endTime=-1,
> >>  > > lastCancelledTime=-1]
> >>  > >
> >>  > > Rebalance statistics:
> >>  > >
> >>  > > To speed up rebalance analysis, statistics will be output for each
> >>  cache
> >>  > > group and total for all cache groups.
> >>  > > If duration rebalance for cache group is greater than threshold
> value,
> >>  > > partition distribution is output.
> >>  > > Statistics will you to analyze duration of the balance for each
> >>  supplier
> >>  > > to understand which of them has been transmitting data for longest
> >>  time.
> >>  > >
> >>  > > System properties are used to output statistics:
> >>  > >
> >>  > > IGNITE_QUIET - to output statistics, value must be false;
> >>  > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> >>  > > duration rebalance of cache group after which partitions
> distribution
> >>  is
> >>  > > output, set in milliseconds, default value is 10 minutes.
> >>  > >
> >>  > > Statistics examples:
> >>  > >
> >>  > > Successful full and historical rebalance of group cache, without
> >>  > > partitions distribution.
> >>  > > Rebalance information per cache group (successful rebalance):
> >>  [id=3181548,
> >>  > > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> >>  > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0,
> >>  p=5,
> >>  > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e -
> >>  entries, b -
> >>  > > bytes, d - duration, h - historical, nodeId mapping
> >>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>  > > [1=rebalancing.RebalanceStatisticsTest0]
> >>  > > Rebalance information per cache group (successful rebalance):
> >>  [id=3181547,
> >>  > > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
> >>  > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics:
> [nodeId=0,
> >>  hp=10,
> >>  > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries,
> b -
> >>  > > bytes, d - duration, h - historical, nodeId mapping
> >>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> >>  > >
> >>  > > Successful full and historical rebalance of group cache, with
> >>  partitions
> >>  > > distribution.
> >>  > > Rebalance information per cache group (successful rebalance):
> >>  [id=3181548,
> >>  > > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> >>  > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0,
> >>  p=5,
> >>  > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e -
> >>  entries, b -
> >>  > > bytes, d - duration, h - historical, nodeId mapping
> >>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>  > > [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was
> greater
> >>  > > than 5 ms, printing detailed information about partitions
> distribution
> >>  > > (threshold can be changed by setting number of milliseconds into
> >>  > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
> >>  > > [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
> >>  > > [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
> >>  > > [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
> >>  > > [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 =
> >>  > > [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] Aliases:
> pr -
> >>  > > primary, bu - backup, su - supplier node, h - historical, nodeId
> >>  mapping
> >>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>  > > [1=rebalancing.RebalanceStatisticsTest2]
> >>  > > [2=rebalancing.RebalanceStatisticsTest0]
> >>  > > Rebalance information per cache group (successful rebalance):
> >>  [id=3181547,
> >>  > > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
> >>  > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics:
> [nodeId=0,
> >>  hp=10,
> >>  > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries,
> b -
> >>  > > bytes, d - duration, h - historical, nodeId mapping
> >>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> >>  Rebalance
> >>  > > duration was greater than 100 ms, printing detailed information
> about
> >>  > > partitions distribution (threshold can be changed by setting
> number of
> >>  > > milliseconds into
> >>  IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD)
> >>  > > 0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 =
> >>  > > [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 =
> >>  > > [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 =
> >>  > > [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 =
> >>  > > [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases:
> pr -
> >>  > > primary, bu - backup, su - supplier node, h - historical, nodeId
> >>  mapping
> >>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> >>  > > [1=rebalancing.RebalanceStatisticsTest2]
> >>  > > [2=rebalancing.RebalanceStatisticsTest1]
> >>  > >
> >>  > > Interrupted rebalance of group cache.
> >>  > > Rebalance information per cache group (interrupted rebalance):
> >>  > > [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969,
> >>  > > finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0]
> >>  > >
> >>  > > Total full and historical rebalance for all cache groups.
> >>  > > Rebalance total information (including successful and not
> rebalances):
> >>  > > [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13
> 10:55:18,780,
> >>  > > d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000,
> d=54 ms]
> >>  > > [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p - partitions,
> e -
> >>  > > entries, b - bytes, d - duration, h - historical, nodeId mapping
> >>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>  > > [1=rebalancing.RebalanceStatisticsTest0]
> >>  > > Rebalance total information (including successful and not
> rebalances):
> >>  > > [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13
> 15:01:44,116,
> >>  > > d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500, hb=50445,
> >>  d=294
> >>  > > ms] Aliases: p - partitions, e - entries, b - bytes, d - duration,
> h -
> >>  > > historical, nodeId mapping (nodeId=id,consistentId)
> >>  > > [0=rebalancing.RebalanceStatisticsTest0]
> >>  > >
> >>  > > [1] - https://issues.apache.org/jira/browse/IGNITE-12080
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
>


-- 

Best regards,
Alexei Scherbakov

Re: Extended logging for rebalance performance analysis

Reply via email to