Discussed in personal correspondence with Stas, decided to improve the message: Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1]
into: Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, avgSpeed=5,9 KB/sec, histPartitions=1, histEntries=30, histBytesRcvd=1 KB, fullPartitions=1, fullEntries=30, fullBytesRcvd=3 KB topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1] Where: partitions=2 - total number of partitions received entries=60 - total number of entries received duration=8ms - duration from first demand of message to output of message to log bytesRcvd=5,9 KB - total number of bytes received in B,KB,MB,GB avgSpeed= bytesRcvd/duration in KB/sec histPartitions=1 - total number of partitions received by historical mode histEntries=30 - total number of entries received by historical mode histBytesRcvd=1 KB - total number of bytes received in B,KB,MB,GB by historical mode fullPartitions=1 - total number of partitions received by full mode fullEntries=30 - total number of entries received by full mode fullBytesRcvd=3 KB - total number of bytes received in B,KB,MB,GB by full mode 27.07.2020, 11:50, "ткаленко кирилл" <tkalkir...@yandex.ru>: > Discussed in personal correspondence with Stas, decided to improve the > message: > Completed rebalancing [grp=grp0, > supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, > partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, > topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, > rebalanceId=1] > > into: > Completed rebalancing [grp=grp0, > supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, > partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, avgSpeed=5,9 > KB/sec, > histPartitions=1, histEntries=30, histBytesRcvd=1 KB, > fullPartitions=1, fullEntries=30, fullBytesRcvd=3 KB > topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, > rebalanceId=1] > > Where: > partitions=2 - total number of partitions received > entries=60 - total number of entries received > duration=8ms - duration from first demand of message to output of message to > log > bytesRcvd=5,9 KB - total number of bytes received in B,KB,MB,GB > > avgSpeed= bytesRcvd/duration in KB/sec > > histPartitions=1 - total number of partitions received by historical mode > histEntries=30 - total number of entries received by historical mode > histBytesRcvd=1 KB - total number of bytes received in B,KB,MB,GB by > historical mode > > fullPartitions=1 - total number of partitions received by full mode > fullEntries=30 - total number of entries received by full mode > fullBytesRcvd=3 KB - total number of bytes received in B,KB,MB,GB by full mode > > 03.07.2020, 17:21, "ткаленко кирилл" <tkalkir...@yandex.ru>: >> Sorry, forget. >> >> [1] - >> org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance >> >> 03.07.2020, 17:20, "ткаленко кирилл" <tkalkir...@yandex.ru>: >>> Hi, Stan! >>> >>> I don't understand you yet. >>> >>> Now you can use metrics as it was done in the test [1]. Or can you tell me >>> where to do this, for example when completing rebalancing for all groups? >>> >>> See what is now available and added in the logs: >>> 1)Which group is rebalanced and which type of rebalance. >>> Starting rebalance routine [grp0, topVer=AffinityTopologyVersion >>> [topVer=4, minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, >>> fullPartitions=[4, 7], histPartitions=[], rebalanceId=1] >>> >>> 2) Completion of rebalancing from one of the suppliers. >>> Completed rebalancing [grp=grp0, >>> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, partitions=2, entries=60, >>> duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, >>> minorTopVer=0], progress=1/3, rebalanceId=1] >>> >>> 3) Completion of the entire rebalance. >>> Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, >>> duration=41ms, bytesRcvd=40,4 KB] >>> >>> These messages have a common parameter rebalanceId=1. >>> >>> 03.07.2020, 16:48, "Stanislav Lukyanov" <stanlukya...@gmail.com>: >>>>> On 3 Jul 2020, at 09:51, ткаленко кирилл <tkalkir...@yandex.ru> wrote: >>>>> >>>>> To calculate the average value, you can use the existing metrics >>>>> "RebalancingStartTime", "RebalancingLastCancelledTime", >>>>> "RebalancingEndTime", "RebalancingPartitionsLeft", >>>>> "RebalancingReceivedKeys" and "RebalancingReceivedBytes". >>>> >>>> You can calculate it, and I believe that this is the first thing anyone >>>> would do when reading these logs and metrics. >>>> If that's an essential thing then maybe it should be available out of >>>> the box? >>>> >>>>> This also works correctly with the historical rebalance. >>>>> Now we can see rebalance type for each group and for each supplier in >>>>> logs. I don't think we should duplicate this information. >>>>> >>>>> [2020-07-03 09:49:31,481][INFO >>>>> ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting >>>>> rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion >>>>> [topVer=3, minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a59577100000, >>>>> fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1] >>>> >>>> I'm talking about adding info on how much data has been transferred >>>> during rebalance. >>>> When rebalance completes I'd like to know how much data has been >>>> transferred, was it historical or full, what was the average rebalance >>>> speed. >>>> >>>> There are two reasons for having all that. >>>> >>>> First, it helps to analyze the issues by searching the logs and looking >>>> for anomalies. >>>> >>>> Second, this allows to automate alerts: e.g. if you know your typical >>>> historical rebalance speed, you can trigger an alert if it drops below >>>> that. >>>> >>>>> 03.07.2020, 02:49, "Stanislav Lukyanov" <stanlukya...@gmail.com>: >>>>>> Kirill, >>>>>> >>>>>> I've looked through the patch. >>>>>> Looks good, but it feels like the first thing someone will try to do >>>>>> given bytesRcvd and duration is to divide one by another to get an >>>>>> average speed. >>>>>> Do you think it's reasonable to also add it to the logs? Maybe even >>>>>> to the metrics? >>>>>> >>>>>> Also, this works with historical rebalance, right? Can we specify how >>>>>> much data was transferred via historical or full rebalance from each >>>>>> supplier? >>>>>> Maybe even estimate transfer speed in entries and bytes for each >>>>>> rebalance type? >>>>>> >>>>>> Thanks, >>>>>> Stan >>>>>> >>>>>>> On 29 Jun 2020, at 11:50, Ivan Rakov <ivan.glu...@gmail.com> wrote: >>>>>>> >>>>>>> +1 to Alex G. >>>>>>> >>>>>>> From my experience, the most interesting cases with Ignite >>>>>>> rebalancing >>>>>>> happen exactly in production. According to the fact that we already >>>>>>> have >>>>>>> detailed rebalancing logging, adding info about rebalance >>>>>>> performance looks >>>>>>> like a reasonable improvement. With new logs we'll be able to >>>>>>> detect and >>>>>>> investigate situations when rebalance is slow due to uneven >>>>>>> suppliers >>>>>>> distribution or network issues. >>>>>>> Option to disable the feature in runtime shouldn't be used often, >>>>>>> but it >>>>>>> will keep us on the safe side in case something goes wrong. >>>>>>> The format described in >>>>>>> https://issues.apache.org/jira/browse/IGNITE-12080 looks >>>>>>> good to me. >>>>>>> >>>>>>> On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл >>>>>>> <tkalkir...@yandex.ru> >>>>>>> wrote: >>>>>>> >>>>>>>> Hello, Alexey! >>>>>>>> >>>>>>>> Currently there is no way to disable / enable it, but it seems >>>>>>>> that the >>>>>>>> logs will not be overloaded, since Alexei Scherbakov offer seems >>>>>>>> reasonable >>>>>>>> and compact. Of course, you can add disabling / enabling statistics >>>>>>>> collection via jmx for example. >>>>>>>> >>>>>>>> 23.06.2020, 18:47, "Alexey Goncharuk" <alexey.goncha...@gmail.com>: >>>>>>>>> Hello Maxim, folks, >>>>>>>>> >>>>>>>>> ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mmu...@apache.org>: >>>>>>>>> >>>>>>>>>> We won't do performance analysis on the production environment. >>>>>>>>>> Each >>>>>>>>>> time we need performance analysis it will be done on a test >>>>>>>>>> environment with verbose logging enabled. Thus I suggest moving >>>>>>>>>> these >>>>>>>>>> changes to a separate `profiling` module and extend the logging >>>>>>>>>> much >>>>>>>>>> more without any ышяу limitations. The same as these [2] [3] >>>>>>>>>> activities do. >>>>>>>>> >>>>>>>>> I strongly disagree with this statement. I am not sure who is >>>>>>>>> meant here >>>>>>>>> by 'we', but I see a strong momentum in increasing observability >>>>>>>>> tooling >>>>>>>>> that helps people to understand what exactly happens in the >>>>>>>>> production >>>>>>>>> environment [1]. Not everybody can afford two identical >>>>>>>>> environments for >>>>>>>>> testing. We should make sure users have enough information to >>>>>>>>> understand >>>>>>>>> the root cause after the incident happened, and not force them to >>>>>>>> reproduce >>>>>>>>> it, let alone make them add another module to the classpath and >>>>>>>>> restart >>>>>>>> the >>>>>>>>> nodes. >>>>>>>>> I think having this functionality in the core module with the >>>>>>>>> ability to >>>>>>>>> disable/enable it is the right approach. Having the information >>>>>>>>> printed >>>>>>>> to >>>>>>>>> log is ok, having it in an event that can be sent to a >>>>>>>>> monitoring/tracing >>>>>>>>> subsystem is even better. >>>>>>>>> >>>>>>>>> Kirill, can we enable and disable this feature in runtime to >>>>>>>>> avoid the >>>>>>>> very >>>>>>>>> same nodes restart? >>>>>>>>> >>>>>>>>> [1] >>>>>>>> >>>>>>>> https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/