Hi, Kirill. Looks good to me.
вт, 23 июн. 2020 г. в 18:05, ткаленко кирилл <tkalkir...@yandex.ru>: > Hello, Alexey! > > I suggest that we decide what we can do within ticket [1]. > > Add "rebalanceId" and "topVer" related to rebalance to all messages. > > Add statistical information to a log message: > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing > [rebalanceId=1, grp=cache1, > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, partitions=5, entries=100, > duration=12ms, > bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > progress=1/2] > > Add a message to log after rebalancing for all cache groups is complete: > [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain: > [rebalanceId=2, partitions=10, entries=200, duration=50ms, bytesRcvd=10M] > > Any comments or suggestions? > > [1] - https://issues.apache.org/jira/browse/IGNITE-12080 > > 20.05.2020, 23:08, "ткаленко кирилл" <tkalkir...@yandex.ru>: > > Hello, Alexey! Unfortunately, my response was delayed. > > > > Point 2: You can do as you suggested, I think it is still worth > specifying how many partitions were obtained. > > > > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1, > > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, partitions=5, > entries=100, duration=12ms, > > bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > > progress=1/2] > > > > Point 3: is It "rebalanceId"? > > > > Point 5: I think we can output a summary for each supplier, so as not to > keep it in mind. > > > > [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain: > > [rebalanceId=2, [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, > partitions=5, entries=100, duration=12ms, bytesRcvd=5M], > > [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00002, partitions=5, > entries=100, duration=12ms, bytesRcvd=5M]] > > > > I can add "rebalanceId" to each message that you gave at above. > > > > A detailed message will help us understand how correctly the suppliers > were selected. > > > > 06.05.2020, 22:08, "Alexei Scherbakov" <alexey.scherbak...@gmail.com>: > >> Hello. > >> > >> Let's look at existing rebalancing log for a single group: > >> > >> [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled > >> [order=[ignite-sys-cache, cache1, cache2, default], > >> top=AffinityTopologyVersion [topVer=3, minorTopVer=1], > >> evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d00002] > >> ... > >> [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1, > >> mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, > >> partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3, > >> minorTopVer=1]] > >> [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1, > >> mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000, > >> partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3, > >> minorTopVer=1]] > >> [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine > [cache1, > >> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > >> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, fullPartitions=[1, 5, > 7, 9, > >> 11, 13, 15, 23, 27, 29, 31], histPartitions=[]] > >> [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine > [cache1, > >> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > >> supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000, fullPartitions=[6, 8, > 10, > >> 16, 18, 20, 22, 24, 26, 28], histPartitions=[]] > >> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing > [grp=cache1, > >> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, > >> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2] > >> [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing > >> [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000, > >> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2] > >> [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future: > >> RebalanceFuture [grp=CacheGroupContext [grp=cache1], > >> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > rebalanceId=2, > >> routines=2] > >> > >> From these logs I'm already can get answers to 1 and 4. > >> The logs look concise and easy to read and understand, and should > >> remain what way. > >> > >> But I think some proposed improvements can be done here without harm. > >> > >> 2. OK, let's add it to supplier info per cache with additional info: > >> > >> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing > [grp=cache1, > >> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, entries=100, > duration=12ms, > >> bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > >> progress=1/2] > >> > >> 3. This information is already printed to log. > >> > >> 5. OK, let's add a line which is printed after all groups in the chain > are > >> rebalanced or cancelled with a summary: > >> > >> [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain: > >> [rebalanceId=2, entries=200, duration=50ms, bytesRcvd=10M] > >> > >> Another thing I would suggest: add rebalanceId to all messages for > ease of > >> grepping and to distinguish multiple rebalances for a same topology > version. > >> > >> Regarding the "detailed" message - I wasn't able to figure out any case > >> where it is useful. Not sure if we need it at all. > >> Kirill Tkalenko, can you give me an example? > >> > >> ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mmu...@apache.org>: > >> > >>> Kirill, > >>> > >>> Thank you for raising this topic. It's true that the rebalance > process > >>> still requires additional information for analyzing issues. Please, > >>> don't think that I'm against your changes :-) > >>> > >>> * My short answer. * > >>> > >>> We won't do performance analysis on the production environment. Each > >>> time we need performance analysis it will be done on a test > >>> environment with verbose logging enabled. Thus I suggest moving these > >>> changes to a separate `profiling` module and extend the logging much > >>> more without any ышяу limitations. The same as these [2] [3] > >>> activities do. > >>> > >>> Let's keep the `core` module as simple as possible. > >>> Let's design the right API for accessing rebalance internals for > >>> profiling tools. > >>> Can you, please, remove all changes from your PR [6] which are not > >>> related to the proposed topic? (e.g. variable renamings). > >>> > >>> * The long answer. * > >>> > >>> Here are my thoughts on this. There are two different types of issues > >>> in the rebalance process. The first case must be covered by daily > >>> monitoring subsystems, the second case must be covered by additional > >>> profiling tools. > >>> 1. errors during the rebalancing (e.g. rebalance not happens when > required) > >>> 2. performance rebalancing issues (e.g. rebalance is slow) > >>> > >>> Daily monitoring tools (JMX, Logging) are always turned on and > >>> shouldn't require additional systems resources themselves. Since > these > >>> metrics must be lightweight any internal aggregation machinery is not > >>> used on them. All these metrics are collected from each node > >>> independently. Please, take a look at this issue [1] which covers > most > >>> of your questions mentioned above. > >>> > >>> For all available metrics, we can configure LogExporterSpi, so they > >>> will be available in logs. > >>> > >>> > 1)How long was the balance(divided by supplier)? > >>> rebalancingStartTime, rebalancingEndTime already exists for cache > groups > >>> [4]. > >>> We can add the same for suppliers. > >>> > >>> > 2)How many records and bytes per supplier were rebalanced? > >>> We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4] > >>> We will have rebalancingExpectedKeys [5]. > >>> We can add a new metric per cache keys, per supplier. > >>> > >>> > 3)How many times did the rebalance restart? > >>> rebalancingLastCancelledTime [4] metric exists. > >>> Do we need to keep historical data on it? > >>> > >>> > 4)Which partitions were rebalanced and from which nodes did they > receive > >>> them? > >>> Let's print this information to log prior to the rebalance process > >>> starts. This will be helpful information and do not require a lot of > >>> changes. > >>> > >>> > 5)When did rebalance for all cache groups end? > >>> This metric may be aggregated from rebalancingEndTime [4] by pulling > >>> it from all nodes for all caches. > >>> > >>> Performance rebalancing issues are related to profiling tools. They > >>> may require additional system resources and definitely require a > >>> dedicated environment for tests. We can't do performance analysis on > >>> production environments due to performance impact. > >>> I see some disadvantages of adding such tools to production code: > >>> - verbose logging may affect performance. > >>> - the problematic process may become even worse if an automatic > >>> threshold suddenly turns on. > >>> - new code changes will require additional efforts to keep logging > >>> up-to-date. > >>> > >>> [1] https://issues.apache.org/jira/browse/IGNITE-12183 > >>> [2] https://issues.apache.org/jira/browse/IGNITE-12666 > >>> [3] > >>> > https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool > >>> [4] https://issues.apache.org/jira/browse/IGNITE-12193 > >>> [5] https://issues.apache.org/jira/browse/IGNITE-12194 > >>> [6] https://github.com/apache/ignite/pull/7705/files > >>> > >>> On Wed, 6 May 2020 at 19:50, Ivan Rakov <ivan.glu...@gmail.com> > wrote: > >>> > > >>> > Hi, > >>> > > >>> > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold > >>> > > duration rebalance of cache group after which partitions > distribution > >>> is > >>> > > output, set in milliseconds, default value is 10 minutes. > >>> > > >>> > Does it mean that if the rebalancing process took less than 10 > minutes, > >>> > only a short version of the message (with supplier statistics) > will show > >>> up? > >>> > > >>> > In general, I have no objections. > >>> > > >>> > > >>> > On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл < > tkalkir...@yandex.ru> > >>> > wrote: > >>> > > >>> > > Hi, Igniters! > >>> > > > >>> > > I'd like to share a new small feature in AI [1]. > >>> > > > >>> > > Current rebalance logging does not allow you to quickly answer > >>> following > >>> > > questions: > >>> > > 1)How long was the balance(divided by supplier)? > >>> > > 2)How many records and bytes per supplier were rebalanced? > >>> > > 3)How many times did rebalance restart? > >>> > > 4)Which partitions were rebalanced and from which nodes did they > >>> receive > >>> > > them? > >>> > > 5)When did rebalance for all cache groups end? > >>> > > > >>> > > What you can see in logs now: > >>> > > > >>> > > 1)Starting rebalance with order of cache groups. > >>> > > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0], > >>> > > top=AffinityTopologyVersion [topVer=2, minorTopVer=0], > force=false, > >>> > > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001] > >>> > > > >>> > > 2)Start rebalance of cache group from a specific supplier, > specifying > >>> > > partition ids and mode - historical or full. > >>> > > Starting rebalance routine [ignite-sys-cache, > >>> > > topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], > >>> > > supplier=8c525892-703b-4fc4-b28b-b2f139700000, > fullPartitions=[0-99], > >>> > > histPartitions=[]] > >>> > > > >>> > > 3)Getting partial or complete partitions of cache group. > >>> > > Completed rebalancing [grp=ignite-sys-cache, > >>> > > supplier=8c525892-703b-4fc4-b28b-b2f139700000, > >>> > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], > progress=1/2] > >>> > > Completed (final) rebalancing [grp=ignite-sys-cache, > >>> > > supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001, > >>> > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], > progress=2/2] > >>> > > > >>> > > 4)End rebalance of cache group. > >>> > > Completed rebalance future: RebalanceFuture > [grp=CacheGroupContext > >>> > > [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2, > >>> > > minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200, > >>> > > receivedKeys=0, partitionsLeft=0, startTime=1588519707607, > endTime=-1, > >>> > > lastCancelledTime=-1] > >>> > > > >>> > > Rebalance statistics: > >>> > > > >>> > > To speed up rebalance analysis, statistics will be output for > each > >>> cache > >>> > > group and total for all cache groups. > >>> > > If duration rebalance for cache group is greater than threshold > value, > >>> > > partition distribution is output. > >>> > > Statistics will you to analyze duration of the balance for each > >>> supplier > >>> > > to understand which of them has been transmitting data for > longest > >>> time. > >>> > > > >>> > > System properties are used to output statistics: > >>> > > > >>> > > IGNITE_QUIET - to output statistics, value must be false; > >>> > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - > threshold > >>> > > duration rebalance of cache group after which partitions > distribution > >>> is > >>> > > output, set in milliseconds, default value is 10 minutes. > >>> > > > >>> > > Statistics examples: > >>> > > > >>> > > Successful full and historical rebalance of group cache, without > >>> > > partitions distribution. > >>> > > Rebalance information per cache group (successful rebalance): > >>> [id=3181548, > >>> > > name=grp1, startTime=2020-04-13 10:55:16,117, > finishTime=2020-04-13 > >>> > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: > [nodeId=0, > >>> p=5, > >>> > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - > >>> entries, b - > >>> > > bytes, d - duration, h - historical, nodeId mapping > >>> > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] > >>> > > [1=rebalancing.RebalanceStatisticsTest0] > >>> > > Rebalance information per cache group (successful rebalance): > >>> [id=3181547, > >>> > > name=grp0, startTime=2020-04-13 15:01:44,000, > finishTime=2020-04-13 > >>> > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: > [nodeId=0, > >>> hp=10, > >>> > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - > entries, b - > >>> > > bytes, d - duration, h - historical, nodeId mapping > >>> > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] > >>> > > > >>> > > Successful full and historical rebalance of group cache, with > >>> partitions > >>> > > distribution. > >>> > > Rebalance information per cache group (successful rebalance): > >>> [id=3181548, > >>> > > name=grp1, startTime=2020-04-13 10:55:16,117, > finishTime=2020-04-13 > >>> > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: > [nodeId=0, > >>> p=5, > >>> > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - > >>> entries, b - > >>> > > bytes, d - duration, h - historical, nodeId mapping > >>> > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] > >>> > > [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was > greater > >>> > > than 5 ms, printing detailed information about partitions > distribution > >>> > > (threshold can be changed by setting number of milliseconds into > >>> > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 = > >>> > > [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 = > >>> > > [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 = > >>> > > [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 = > >>> > > [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 = > >>> > > [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] > Aliases: pr - > >>> > > primary, bu - backup, su - supplier node, h - historical, nodeId > >>> mapping > >>> > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] > >>> > > [1=rebalancing.RebalanceStatisticsTest2] > >>> > > [2=rebalancing.RebalanceStatisticsTest0] > >>> > > Rebalance information per cache group (successful rebalance): > >>> [id=3181547, > >>> > > name=grp0, startTime=2020-04-13 15:01:44,000, > finishTime=2020-04-13 > >>> > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: > [nodeId=0, > >>> hp=10, > >>> > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - > entries, b - > >>> > > bytes, d - duration, h - historical, nodeId mapping > >>> > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] > >>> Rebalance > >>> > > duration was greater than 100 ms, printing detailed information > about > >>> > > partitions distribution (threshold can be changed by setting > number of > >>> > > milliseconds into > >>> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) > >>> > > 0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 = > >>> > > [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 = > >>> > > [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 = > >>> > > [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 = > >>> > > [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases: > pr - > >>> > > primary, bu - backup, su - supplier node, h - historical, nodeId > >>> mapping > >>> > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] > >>> > > [1=rebalancing.RebalanceStatisticsTest2] > >>> > > [2=rebalancing.RebalanceStatisticsTest1] > >>> > > > >>> > > Interrupted rebalance of group cache. > >>> > > Rebalance information per cache group (interrupted rebalance): > >>> > > [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969, > >>> > > finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0] > >>> > > > >>> > > Total full and historical rebalance for all cache groups. > >>> > > Rebalance total information (including successful and not > rebalances): > >>> > > [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13 > 10:55:18,780, > >>> > > d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000, > d=54 ms] > >>> > > [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p - > partitions, e - > >>> > > entries, b - bytes, d - duration, h - historical, nodeId mapping > >>> > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] > >>> > > [1=rebalancing.RebalanceStatisticsTest0] > >>> > > Rebalance total information (including successful and not > rebalances): > >>> > > [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13 > 15:01:44,116, > >>> > > d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500, > hb=50445, > >>> d=294 > >>> > > ms] Aliases: p - partitions, e - entries, b - bytes, d - > duration, h - > >>> > > historical, nodeId mapping (nodeId=id,consistentId) > >>> > > [0=rebalancing.RebalanceStatisticsTest0] > >>> > > > >>> > > [1] - https://issues.apache.org/jira/browse/IGNITE-12080 > >> > >> -- > >> > >> Best regards, > >> Alexei Scherbakov > -- Best regards, Alexei Scherbakov