Re: Extended logging for rebalance performance analysis

Stanislav Lukyanov Fri, 03 Jul 2020 06:49:20 -0700


> On 3 Jul 2020, at 09:51, ткаленко кирилл <tkalkir...@yandex.ru> wrote:
> 
> To calculate the average value, you can use the existing metrics 
> "RebalancingStartTime", "RebalancingLastCancelledTime", "RebalancingEndTime", 
> "RebalancingPartitionsLeft", "RebalancingReceivedKeys" and 
> "RebalancingReceivedBytes".


You can calculate it, and I believe that this is the first thing anyone would 
do when reading these logs and metrics.
If that's an essential thing then maybe it should be available out of the box?

> 
> This also works correctly with the historical rebalance.
> Now we can see rebalance type for each group and for each supplier in logs. I 
> don't think we should duplicate this information.
> 
> [2020-07-03 09:49:31,481][INFO 
> ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance 
> routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, 
> minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a59577100000, 
> fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]

I'm talking about adding info on how much data has been transferred during 
rebalance.
When rebalance completes I'd like to know how much data has been transferred, 
was it historical or full, what was the average rebalance speed.

There are two reasons for having all that.

First, it helps to analyze the issues by searching the logs and looking for 
anomalies.

Second, this allows to automate alerts: e.g. if you know your typical 
historical rebalance speed, you can trigger an alert if it drops below that.

> 
> 03.07.2020, 02:49, "Stanislav Lukyanov" <stanlukya...@gmail.com>:
>> Kirill,
>> 
>> I've looked through the patch.
>> Looks good, but it feels like the first thing someone will try to do given 
>> bytesRcvd and duration is to divide one by another to get an average speed.
>> Do you think it's reasonable to also add it to the logs? Maybe even to the 
>> metrics?
>> 
>> Also, this works with historical rebalance, right? Can we specify how much 
>> data was transferred via historical or full rebalance from each supplier?
>> Maybe even estimate transfer speed in entries and bytes for each rebalance 
>> type?
>> 
>> Thanks,
>> Stan
>> 
>>>  On 29 Jun 2020, at 11:50, Ivan Rakov <ivan.glu...@gmail.com> wrote:
>>> 
>>>  +1 to Alex G.
>>> 
>>>  From my experience, the most interesting cases with Ignite rebalancing
>>>  happen exactly in production. According to the fact that we already have
>>>  detailed rebalancing logging, adding info about rebalance performance looks
>>>  like a reasonable improvement. With new logs we'll be able to detect and
>>>  investigate situations when rebalance is slow due to uneven suppliers
>>>  distribution or network issues.
>>>  Option to disable the feature in runtime shouldn't be used often, but it
>>>  will keep us on the safe side in case something goes wrong.
>>>  The format described in
>>>  https://issues.apache.org/jira/browse/IGNITE-12080 looks
>>>  good to me.
>>> 
>>>  On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл <tkalkir...@yandex.ru>
>>>  wrote:
>>> 
>>>>  Hello, Alexey!
>>>> 
>>>>  Currently there is no way to disable / enable it, but it seems that the
>>>>  logs will not be overloaded, since Alexei Scherbakov offer seems 
>>>> reasonable
>>>>  and compact. Of course, you can add disabling / enabling statistics
>>>>  collection via jmx for example.
>>>> 
>>>>  23.06.2020, 18:47, "Alexey Goncharuk" <alexey.goncha...@gmail.com>:
>>>>>  Hello Maxim, folks,
>>>>> 
>>>>>  ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mmu...@apache.org>:
>>>>> 
>>>>>>  We won't do performance analysis on the production environment. Each
>>>>>>  time we need performance analysis it will be done on a test
>>>>>>  environment with verbose logging enabled. Thus I suggest moving these
>>>>>>  changes to a separate `profiling` module and extend the logging much
>>>>>>  more without any ышяу limitations. The same as these [2] [3]
>>>>>>  activities do.
>>>>> 
>>>>>  I strongly disagree with this statement. I am not sure who is meant here
>>>>>  by 'we', but I see a strong momentum in increasing observability tooling
>>>>>  that helps people to understand what exactly happens in the production
>>>>>  environment [1]. Not everybody can afford two identical environments for
>>>>>  testing. We should make sure users have enough information to understand
>>>>>  the root cause after the incident happened, and not force them to
>>>>  reproduce
>>>>>  it, let alone make them add another module to the classpath and restart
>>>>  the
>>>>>  nodes.
>>>>>  I think having this functionality in the core module with the ability to
>>>>>  disable/enable it is the right approach. Having the information printed
>>>>  to
>>>>>  log is ok, having it in an event that can be sent to a monitoring/tracing
>>>>>  subsystem is even better.
>>>>> 
>>>>>  Kirill, can we enable and disable this feature in runtime to avoid the
>>>>  very
>>>>>  same nodes restart?
>>>>> 
>>>>>  [1]
>>>>  https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/

Re: Extended logging for rebalance performance analysis

Reply via email to