Re: Extended logging for rebalance performance analysis

Stanislav Lukyanov Fri, 24 Jul 2020 03:31:26 -0700

Hi Kirill,

I think it would be helpful to add to the log line
    Completed rebalancing [grp=grp0, 
supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, partitions=2, entries=60, 
duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, 
minorTopVer=0], progress=1/3, rebalanceId=1]
a few more metrics:
    avgSpeed=<bytesRcvd/duration>, partitionsNum=N, histPartitionsNum=M, 
histBytesRcvd=X KB, histDuration=Yms, 
histAvgSpeed=<histBytesRcvd/histDuration>, fullPartitionsNum=K, fullBytesRcvd=X 
KB, fullDuration=Yms, fullAvgSpeed=<fullBytesRcvd/fullDuration>


I believe these additional metrics are good in that a user can easily setup 
monitoring based on them, and they give a lot of useful info about rebalance 
performance in general. No need to look for the "Starting rebalance" message, 
everything is in one place.
If any of these are hard to add then we can skip them (I suspect histDuration 
and fullDuration can't be measured in practice but don't know for sure). Let's 
add what we can.

Thanks,
Stan

> On 3 Jul 2020, at 17:21, ткаленко кирилл <tkalkir...@yandex.ru> wrote:
> 
> Sorry, forget.
> 
> [1] - 
> org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance
> 
> 03.07.2020, 17:20, "ткаленко кирилл" <tkalkir...@yandex.ru>:
>> Hi, Stan!
>> 
>> I don't understand you yet.
>> 
>> Now you can use metrics as it was done in the test [1]. Or can you tell me 
>> where to do this, for example when completing rebalancing for all groups?
>> 
>> See what is now available and added in the logs:
>> 1)Which group is rebalanced and which type of rebalance.
>> Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, 
>> minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, 
>> fullPartitions=[4, 7], histPartitions=[], rebalanceId=1]
>> 
>> 2) Completion of rebalancing from one of the suppliers.
>> Completed rebalancing [grp=grp0, 
>> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, partitions=2, entries=60, 
>> duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, 
>> minorTopVer=0], progress=1/3, rebalanceId=1]
>> 
>> 3) Completion of the entire rebalance.
>> Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, 
>> duration=41ms, bytesRcvd=40,4 KB]
>> 
>> These messages have a common parameter rebalanceId=1.
>> 
>> 03.07.2020, 16:48, "Stanislav Lukyanov" <stanlukya...@gmail.com>:
>>>>   On 3 Jul 2020, at 09:51, ткаленко кирилл <tkalkir...@yandex.ru> wrote:
>>>> 
>>>>   To calculate the average value, you can use the existing metrics 
>>>> "RebalancingStartTime", "RebalancingLastCancelledTime", 
>>>> "RebalancingEndTime", "RebalancingPartitionsLeft", 
>>>> "RebalancingReceivedKeys" and "RebalancingReceivedBytes".
>>> 
>>>  You can calculate it, and I believe that this is the first thing anyone 
>>> would do when reading these logs and metrics.
>>>  If that's an essential thing then maybe it should be available out of the 
>>> box?
>>> 
>>>>   This also works correctly with the historical rebalance.
>>>>   Now we can see rebalance type for each group and for each supplier in 
>>>> logs. I don't think we should duplicate this information.
>>>> 
>>>>   [2020-07-03 09:49:31,481][INFO 
>>>> ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance 
>>>> routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, 
>>>> minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a59577100000, 
>>>> fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]
>>> 
>>>  I'm talking about adding info on how much data has been transferred during 
>>> rebalance.
>>>  When rebalance completes I'd like to know how much data has been 
>>> transferred, was it historical or full, what was the average rebalance 
>>> speed.
>>> 
>>>  There are two reasons for having all that.
>>> 
>>>  First, it helps to analyze the issues by searching the logs and looking 
>>> for anomalies.
>>> 
>>>  Second, this allows to automate alerts: e.g. if you know your typical 
>>> historical rebalance speed, you can trigger an alert if it drops below that.
>>> 
>>>>   03.07.2020, 02:49, "Stanislav Lukyanov" <stanlukya...@gmail.com>:
>>>>>   Kirill,
>>>>> 
>>>>>   I've looked through the patch.
>>>>>   Looks good, but it feels like the first thing someone will try to do 
>>>>> given bytesRcvd and duration is to divide one by another to get an 
>>>>> average speed.
>>>>>   Do you think it's reasonable to also add it to the logs? Maybe even to 
>>>>> the metrics?
>>>>> 
>>>>>   Also, this works with historical rebalance, right? Can we specify how 
>>>>> much data was transferred via historical or full rebalance from each 
>>>>> supplier?
>>>>>   Maybe even estimate transfer speed in entries and bytes for each 
>>>>> rebalance type?
>>>>> 
>>>>>   Thanks,
>>>>>   Stan
>>>>> 
>>>>>>    On 29 Jun 2020, at 11:50, Ivan Rakov <ivan.glu...@gmail.com> wrote:
>>>>>> 
>>>>>>    +1 to Alex G.
>>>>>> 
>>>>>>    From my experience, the most interesting cases with Ignite rebalancing
>>>>>>    happen exactly in production. According to the fact that we already 
>>>>>> have
>>>>>>    detailed rebalancing logging, adding info about rebalance performance 
>>>>>> looks
>>>>>>    like a reasonable improvement. With new logs we'll be able to detect 
>>>>>> and
>>>>>>    investigate situations when rebalance is slow due to uneven suppliers
>>>>>>    distribution or network issues.
>>>>>>    Option to disable the feature in runtime shouldn't be used often, but 
>>>>>> it
>>>>>>    will keep us on the safe side in case something goes wrong.
>>>>>>    The format described in
>>>>>>    https://issues.apache.org/jira/browse/IGNITE-12080 looks
>>>>>>    good to me.
>>>>>> 
>>>>>>    On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл <tkalkir...@yandex.ru>
>>>>>>    wrote:
>>>>>> 
>>>>>>>    Hello, Alexey!
>>>>>>> 
>>>>>>>    Currently there is no way to disable / enable it, but it seems that 
>>>>>>> the
>>>>>>>    logs will not be overloaded, since Alexei Scherbakov offer seems 
>>>>>>> reasonable
>>>>>>>    and compact. Of course, you can add disabling / enabling statistics
>>>>>>>    collection via jmx for example.
>>>>>>> 
>>>>>>>    23.06.2020, 18:47, "Alexey Goncharuk" <alexey.goncha...@gmail.com>:
>>>>>>>>    Hello Maxim, folks,
>>>>>>>> 
>>>>>>>>    ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mmu...@apache.org>:
>>>>>>>> 
>>>>>>>>>    We won't do performance analysis on the production environment. 
>>>>>>>>> Each
>>>>>>>>>    time we need performance analysis it will be done on a test
>>>>>>>>>    environment with verbose logging enabled. Thus I suggest moving 
>>>>>>>>> these
>>>>>>>>>    changes to a separate `profiling` module and extend the logging 
>>>>>>>>> much
>>>>>>>>>    more without any ышяу limitations. The same as these [2] [3]
>>>>>>>>>    activities do.
>>>>>>>> 
>>>>>>>>    I strongly disagree with this statement. I am not sure who is meant 
>>>>>>>> here
>>>>>>>>    by 'we', but I see a strong momentum in increasing observability 
>>>>>>>> tooling
>>>>>>>>    that helps people to understand what exactly happens in the 
>>>>>>>> production
>>>>>>>>    environment [1]. Not everybody can afford two identical 
>>>>>>>> environments for
>>>>>>>>    testing. We should make sure users have enough information to 
>>>>>>>> understand
>>>>>>>>    the root cause after the incident happened, and not force them to
>>>>>>>    reproduce
>>>>>>>>    it, let alone make them add another module to the classpath and 
>>>>>>>> restart
>>>>>>>    the
>>>>>>>>    nodes.
>>>>>>>>    I think having this functionality in the core module with the 
>>>>>>>> ability to
>>>>>>>>    disable/enable it is the right approach. Having the information 
>>>>>>>> printed
>>>>>>>    to
>>>>>>>>    log is ok, having it in an event that can be sent to a 
>>>>>>>> monitoring/tracing
>>>>>>>>    subsystem is even better.
>>>>>>>> 
>>>>>>>>    Kirill, can we enable and disable this feature in runtime to avoid 
>>>>>>>> the
>>>>>>>    very
>>>>>>>>    same nodes restart?
>>>>>>>> 
>>>>>>>>    [1]
>>>>>>>    https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/

Re: Extended logging for rebalance performance analysis

Reply via email to