Re: Extended logging for rebalance performance analysis

ткаленко кирилл Mon, 27 Jul 2020 01:52:22 -0700

Discussed in personal correspondence with Stas, decided to improve the message:
Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, 
        partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB,
        topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, 
rebalanceId=1]


into:
Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, 
        partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, avgSpeed=5,9 
KB/sec,
        histPartitions=1, histEntries=30, histBytesRcvd=1 KB,
        fullPartitions=1, fullEntries=30, fullBytesRcvd=3 KB
        topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, 
rebalanceId=1]

Where:
partitions=2 - total number of partitions received
entries=60 - total number of entries received
duration=8ms - duration from first demand of message to output of message to log
bytesRcvd=5,9 KB - total number of bytes received in B,KB,MB,GB

avgSpeed= bytesRcvd/duration in KB/sec

histPartitions=1 - total number of partitions received by historical mode
histEntries=30 - total number of entries received by historical mode
histBytesRcvd=1 KB - total number of bytes received in B,KB,MB,GB by historical 
mode

fullPartitions=1 - total number of partitions received by full mode
fullEntries=30 - total number of entries received by full mode
fullBytesRcvd=3 KB - total number of bytes received in B,KB,MB,GB by full mode

27.07.2020, 11:50, "ткаленко кирилл" <tkalkir...@yandex.ru>:
> Discussed in personal correspondence with Stas, decided to improve the 
> message:
> Completed rebalancing [grp=grp0, 
> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001,
>     partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB,
>     topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, 
> rebalanceId=1]
>
> into:
> Completed rebalancing [grp=grp0, 
> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001,
>     partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, avgSpeed=5,9 
> KB/sec,
>     histPartitions=1, histEntries=30, histBytesRcvd=1 KB,
>     fullPartitions=1, fullEntries=30, fullBytesRcvd=3 KB
>     topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, 
> rebalanceId=1]
>
> Where:
> partitions=2 - total number of partitions received
> entries=60 - total number of entries received
> duration=8ms - duration from first demand of message to output of message to 
> log
> bytesRcvd=5,9 KB - total number of bytes received in B,KB,MB,GB
>
> avgSpeed= bytesRcvd/duration in KB/sec
>
> histPartitions=1 - total number of partitions received by historical mode
> histEntries=30 - total number of entries received by historical mode
> histBytesRcvd=1 KB - total number of bytes received in B,KB,MB,GB by 
> historical mode
>
> fullPartitions=1 - total number of partitions received by full mode
> fullEntries=30 - total number of entries received by full mode
> fullBytesRcvd=3 KB - total number of bytes received in B,KB,MB,GB by full mode
>
> 03.07.2020, 17:21, "ткаленко кирилл" <tkalkir...@yandex.ru>:
>> Sorry, forget.
>>
>> [1] - 
>> org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance
>>
>> 03.07.2020, 17:20, "ткаленко кирилл" <tkalkir...@yandex.ru>:
>>>  Hi, Stan!
>>>
>>>  I don't understand you yet.
>>>
>>>  Now you can use metrics as it was done in the test [1]. Or can you tell me 
>>> where to do this, for example when completing rebalancing for all groups?
>>>
>>>  See what is now available and added in the logs:
>>>  1)Which group is rebalanced and which type of rebalance.
>>>  Starting rebalance routine [grp0, topVer=AffinityTopologyVersion 
>>> [topVer=4, minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, 
>>> fullPartitions=[4, 7], histPartitions=[], rebalanceId=1]
>>>
>>>  2) Completion of rebalancing from one of the suppliers.
>>>  Completed rebalancing [grp=grp0, 
>>> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, partitions=2, entries=60, 
>>> duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, 
>>> minorTopVer=0], progress=1/3, rebalanceId=1]
>>>
>>>  3) Completion of the entire rebalance.
>>>  Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, 
>>> duration=41ms, bytesRcvd=40,4 KB]
>>>
>>>  These messages have a common parameter rebalanceId=1.
>>>
>>>  03.07.2020, 16:48, "Stanislav Lukyanov" <stanlukya...@gmail.com>:
>>>>>    On 3 Jul 2020, at 09:51, ткаленко кирилл <tkalkir...@yandex.ru> wrote:
>>>>>
>>>>>    To calculate the average value, you can use the existing metrics 
>>>>> "RebalancingStartTime", "RebalancingLastCancelledTime", 
>>>>> "RebalancingEndTime", "RebalancingPartitionsLeft", 
>>>>> "RebalancingReceivedKeys" and "RebalancingReceivedBytes".
>>>>
>>>>   You can calculate it, and I believe that this is the first thing anyone 
>>>> would do when reading these logs and metrics.
>>>>   If that's an essential thing then maybe it should be available out of 
>>>> the box?
>>>>
>>>>>    This also works correctly with the historical rebalance.
>>>>>    Now we can see rebalance type for each group and for each supplier in 
>>>>> logs. I don't think we should duplicate this information.
>>>>>
>>>>>    [2020-07-03 09:49:31,481][INFO 
>>>>> ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting 
>>>>> rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion 
>>>>> [topVer=3, minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a59577100000, 
>>>>> fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]
>>>>
>>>>   I'm talking about adding info on how much data has been transferred 
>>>> during rebalance.
>>>>   When rebalance completes I'd like to know how much data has been 
>>>> transferred, was it historical or full, what was the average rebalance 
>>>> speed.
>>>>
>>>>   There are two reasons for having all that.
>>>>
>>>>   First, it helps to analyze the issues by searching the logs and looking 
>>>> for anomalies.
>>>>
>>>>   Second, this allows to automate alerts: e.g. if you know your typical 
>>>> historical rebalance speed, you can trigger an alert if it drops below 
>>>> that.
>>>>
>>>>>    03.07.2020, 02:49, "Stanislav Lukyanov" <stanlukya...@gmail.com>:
>>>>>>    Kirill,
>>>>>>
>>>>>>    I've looked through the patch.
>>>>>>    Looks good, but it feels like the first thing someone will try to do 
>>>>>> given bytesRcvd and duration is to divide one by another to get an 
>>>>>> average speed.
>>>>>>    Do you think it's reasonable to also add it to the logs? Maybe even 
>>>>>> to the metrics?
>>>>>>
>>>>>>    Also, this works with historical rebalance, right? Can we specify how 
>>>>>> much data was transferred via historical or full rebalance from each 
>>>>>> supplier?
>>>>>>    Maybe even estimate transfer speed in entries and bytes for each 
>>>>>> rebalance type?
>>>>>>
>>>>>>    Thanks,
>>>>>>    Stan
>>>>>>
>>>>>>>     On 29 Jun 2020, at 11:50, Ivan Rakov <ivan.glu...@gmail.com> wrote:
>>>>>>>
>>>>>>>     +1 to Alex G.
>>>>>>>
>>>>>>>     From my experience, the most interesting cases with Ignite 
>>>>>>> rebalancing
>>>>>>>     happen exactly in production. According to the fact that we already 
>>>>>>> have
>>>>>>>     detailed rebalancing logging, adding info about rebalance 
>>>>>>> performance looks
>>>>>>>     like a reasonable improvement. With new logs we'll be able to 
>>>>>>> detect and
>>>>>>>     investigate situations when rebalance is slow due to uneven 
>>>>>>> suppliers
>>>>>>>     distribution or network issues.
>>>>>>>     Option to disable the feature in runtime shouldn't be used often, 
>>>>>>> but it
>>>>>>>     will keep us on the safe side in case something goes wrong.
>>>>>>>     The format described in
>>>>>>>     https://issues.apache.org/jira/browse/IGNITE-12080 looks
>>>>>>>     good to me.
>>>>>>>
>>>>>>>     On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл 
>>>>>>> <tkalkir...@yandex.ru>
>>>>>>>     wrote:
>>>>>>>
>>>>>>>>     Hello, Alexey!
>>>>>>>>
>>>>>>>>     Currently there is no way to disable / enable it, but it seems 
>>>>>>>> that the
>>>>>>>>     logs will not be overloaded, since Alexei Scherbakov offer seems 
>>>>>>>> reasonable
>>>>>>>>     and compact. Of course, you can add disabling / enabling statistics
>>>>>>>>     collection via jmx for example.
>>>>>>>>
>>>>>>>>     23.06.2020, 18:47, "Alexey Goncharuk" <alexey.goncha...@gmail.com>:
>>>>>>>>>     Hello Maxim, folks,
>>>>>>>>>
>>>>>>>>>     ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mmu...@apache.org>:
>>>>>>>>>
>>>>>>>>>>     We won't do performance analysis on the production environment. 
>>>>>>>>>> Each
>>>>>>>>>>     time we need performance analysis it will be done on a test
>>>>>>>>>>     environment with verbose logging enabled. Thus I suggest moving 
>>>>>>>>>> these
>>>>>>>>>>     changes to a separate `profiling` module and extend the logging 
>>>>>>>>>> much
>>>>>>>>>>     more without any ышяу limitations. The same as these [2] [3]
>>>>>>>>>>     activities do.
>>>>>>>>>
>>>>>>>>>     I strongly disagree with this statement. I am not sure who is 
>>>>>>>>> meant here
>>>>>>>>>     by 'we', but I see a strong momentum in increasing observability 
>>>>>>>>> tooling
>>>>>>>>>     that helps people to understand what exactly happens in the 
>>>>>>>>> production
>>>>>>>>>     environment [1]. Not everybody can afford two identical 
>>>>>>>>> environments for
>>>>>>>>>     testing. We should make sure users have enough information to 
>>>>>>>>> understand
>>>>>>>>>     the root cause after the incident happened, and not force them to
>>>>>>>>     reproduce
>>>>>>>>>     it, let alone make them add another module to the classpath and 
>>>>>>>>> restart
>>>>>>>>     the
>>>>>>>>>     nodes.
>>>>>>>>>     I think having this functionality in the core module with the 
>>>>>>>>> ability to
>>>>>>>>>     disable/enable it is the right approach. Having the information 
>>>>>>>>> printed
>>>>>>>>     to
>>>>>>>>>     log is ok, having it in an event that can be sent to a 
>>>>>>>>> monitoring/tracing
>>>>>>>>>     subsystem is even better.
>>>>>>>>>
>>>>>>>>>     Kirill, can we enable and disable this feature in runtime to 
>>>>>>>>> avoid the
>>>>>>>>     very
>>>>>>>>>     same nodes restart?
>>>>>>>>>
>>>>>>>>>     [1]
>>>>>>>>     
>>>>>>>> https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/

Re: Extended logging for rebalance performance analysis

Reply via email to