Discussed in personal correspondence with Stas, decided to improve the message:
Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001,
partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB,
topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3,
rebalanceId=1]
into:
Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001,
partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, avgSpeed=5,9
KB/sec,
histPartitions=1, histEntries=30, histBytesRcvd=1 KB,
fullPartitions=1, fullEntries=30, fullBytesRcvd=3 KB
topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3,
rebalanceId=1]
Where:
partitions=2 - total number of partitions received
entries=60 - total number of entries received
duration=8ms - duration from first demand of message to output of message to log
bytesRcvd=5,9 KB - total number of bytes received in B,KB,MB,GB
avgSpeed= bytesRcvd/duration in KB/sec
histPartitions=1 - total number of partitions received by historical mode
histEntries=30 - total number of entries received by historical mode
histBytesRcvd=1 KB - total number of bytes received in B,KB,MB,GB by historical
mode
fullPartitions=1 - total number of partitions received by full mode
fullEntries=30 - total number of entries received by full mode
fullBytesRcvd=3 KB - total number of bytes received in B,KB,MB,GB by full mode
27.07.2020, 11:50, "ткаленко кирилл" <[email protected]>:
> Discussed in personal correspondence with Stas, decided to improve the
> message:
> Completed rebalancing [grp=grp0,
> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001,
> partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB,
> topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3,
> rebalanceId=1]
>
> into:
> Completed rebalancing [grp=grp0,
> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001,
> partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, avgSpeed=5,9
> KB/sec,
> histPartitions=1, histEntries=30, histBytesRcvd=1 KB,
> fullPartitions=1, fullEntries=30, fullBytesRcvd=3 KB
> topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3,
> rebalanceId=1]
>
> Where:
> partitions=2 - total number of partitions received
> entries=60 - total number of entries received
> duration=8ms - duration from first demand of message to output of message to
> log
> bytesRcvd=5,9 KB - total number of bytes received in B,KB,MB,GB
>
> avgSpeed= bytesRcvd/duration in KB/sec
>
> histPartitions=1 - total number of partitions received by historical mode
> histEntries=30 - total number of entries received by historical mode
> histBytesRcvd=1 KB - total number of bytes received in B,KB,MB,GB by
> historical mode
>
> fullPartitions=1 - total number of partitions received by full mode
> fullEntries=30 - total number of entries received by full mode
> fullBytesRcvd=3 KB - total number of bytes received in B,KB,MB,GB by full mode
>
> 03.07.2020, 17:21, "ткаленко кирилл" <[email protected]>:
>> Sorry, forget.
>>
>> [1] -
>> org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance
>>
>> 03.07.2020, 17:20, "ткаленко кирилл" <[email protected]>:
>>> Hi, Stan!
>>>
>>> I don't understand you yet.
>>>
>>> Now you can use metrics as it was done in the test [1]. Or can you tell me
>>> where to do this, for example when completing rebalancing for all groups?
>>>
>>> See what is now available and added in the logs:
>>> 1)Which group is rebalanced and which type of rebalance.
>>> Starting rebalance routine [grp0, topVer=AffinityTopologyVersion
>>> [topVer=4, minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001,
>>> fullPartitions=[4, 7], histPartitions=[], rebalanceId=1]
>>>
>>> 2) Completion of rebalancing from one of the suppliers.
>>> Completed rebalancing [grp=grp0,
>>> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, partitions=2, entries=60,
>>> duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4,
>>> minorTopVer=0], progress=1/3, rebalanceId=1]
>>>
>>> 3) Completion of the entire rebalance.
>>> Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400,
>>> duration=41ms, bytesRcvd=40,4 KB]
>>>
>>> These messages have a common parameter rebalanceId=1.
>>>
>>> 03.07.2020, 16:48, "Stanislav Lukyanov" <[email protected]>:
>>>>> On 3 Jul 2020, at 09:51, ткаленко кирилл <[email protected]> wrote:
>>>>>
>>>>> To calculate the average value, you can use the existing metrics
>>>>> "RebalancingStartTime", "RebalancingLastCancelledTime",
>>>>> "RebalancingEndTime", "RebalancingPartitionsLeft",
>>>>> "RebalancingReceivedKeys" and "RebalancingReceivedBytes".
>>>>
>>>> You can calculate it, and I believe that this is the first thing anyone
>>>> would do when reading these logs and metrics.
>>>> If that's an essential thing then maybe it should be available out of
>>>> the box?
>>>>
>>>>> This also works correctly with the historical rebalance.
>>>>> Now we can see rebalance type for each group and for each supplier in
>>>>> logs. I don't think we should duplicate this information.
>>>>>
>>>>> [2020-07-03 09:49:31,481][INFO
>>>>> ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting
>>>>> rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion
>>>>> [topVer=3, minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a59577100000,
>>>>> fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]
>>>>
>>>> I'm talking about adding info on how much data has been transferred
>>>> during rebalance.
>>>> When rebalance completes I'd like to know how much data has been
>>>> transferred, was it historical or full, what was the average rebalance
>>>> speed.
>>>>
>>>> There are two reasons for having all that.
>>>>
>>>> First, it helps to analyze the issues by searching the logs and looking
>>>> for anomalies.
>>>>
>>>> Second, this allows to automate alerts: e.g. if you know your typical
>>>> historical rebalance speed, you can trigger an alert if it drops below
>>>> that.
>>>>
>>>>> 03.07.2020, 02:49, "Stanislav Lukyanov" <[email protected]>:
>>>>>> Kirill,
>>>>>>
>>>>>> I've looked through the patch.
>>>>>> Looks good, but it feels like the first thing someone will try to do
>>>>>> given bytesRcvd and duration is to divide one by another to get an
>>>>>> average speed.
>>>>>> Do you think it's reasonable to also add it to the logs? Maybe even
>>>>>> to the metrics?
>>>>>>
>>>>>> Also, this works with historical rebalance, right? Can we specify how
>>>>>> much data was transferred via historical or full rebalance from each
>>>>>> supplier?
>>>>>> Maybe even estimate transfer speed in entries and bytes for each
>>>>>> rebalance type?
>>>>>>
>>>>>> Thanks,
>>>>>> Stan
>>>>>>
>>>>>>> On 29 Jun 2020, at 11:50, Ivan Rakov <[email protected]> wrote:
>>>>>>>
>>>>>>> +1 to Alex G.
>>>>>>>
>>>>>>> From my experience, the most interesting cases with Ignite
>>>>>>> rebalancing
>>>>>>> happen exactly in production. According to the fact that we already
>>>>>>> have
>>>>>>> detailed rebalancing logging, adding info about rebalance
>>>>>>> performance looks
>>>>>>> like a reasonable improvement. With new logs we'll be able to
>>>>>>> detect and
>>>>>>> investigate situations when rebalance is slow due to uneven
>>>>>>> suppliers
>>>>>>> distribution or network issues.
>>>>>>> Option to disable the feature in runtime shouldn't be used often,
>>>>>>> but it
>>>>>>> will keep us on the safe side in case something goes wrong.
>>>>>>> The format described in
>>>>>>> https://issues.apache.org/jira/browse/IGNITE-12080 looks
>>>>>>> good to me.
>>>>>>>
>>>>>>> On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл
>>>>>>> <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello, Alexey!
>>>>>>>>
>>>>>>>> Currently there is no way to disable / enable it, but it seems
>>>>>>>> that the
>>>>>>>> logs will not be overloaded, since Alexei Scherbakov offer seems
>>>>>>>> reasonable
>>>>>>>> and compact. Of course, you can add disabling / enabling statistics
>>>>>>>> collection via jmx for example.
>>>>>>>>
>>>>>>>> 23.06.2020, 18:47, "Alexey Goncharuk" <[email protected]>:
>>>>>>>>> Hello Maxim, folks,
>>>>>>>>>
>>>>>>>>> ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <[email protected]>:
>>>>>>>>>
>>>>>>>>>> We won't do performance analysis on the production environment.
>>>>>>>>>> Each
>>>>>>>>>> time we need performance analysis it will be done on a test
>>>>>>>>>> environment with verbose logging enabled. Thus I suggest moving
>>>>>>>>>> these
>>>>>>>>>> changes to a separate `profiling` module and extend the logging
>>>>>>>>>> much
>>>>>>>>>> more without any ышяу limitations. The same as these [2] [3]
>>>>>>>>>> activities do.
>>>>>>>>>
>>>>>>>>> I strongly disagree with this statement. I am not sure who is
>>>>>>>>> meant here
>>>>>>>>> by 'we', but I see a strong momentum in increasing observability
>>>>>>>>> tooling
>>>>>>>>> that helps people to understand what exactly happens in the
>>>>>>>>> production
>>>>>>>>> environment [1]. Not everybody can afford two identical
>>>>>>>>> environments for
>>>>>>>>> testing. We should make sure users have enough information to
>>>>>>>>> understand
>>>>>>>>> the root cause after the incident happened, and not force them to
>>>>>>>> reproduce
>>>>>>>>> it, let alone make them add another module to the classpath and
>>>>>>>>> restart
>>>>>>>> the
>>>>>>>>> nodes.
>>>>>>>>> I think having this functionality in the core module with the
>>>>>>>>> ability to
>>>>>>>>> disable/enable it is the right approach. Having the information
>>>>>>>>> printed
>>>>>>>> to
>>>>>>>>> log is ok, having it in an event that can be sent to a
>>>>>>>>> monitoring/tracing
>>>>>>>>> subsystem is even better.
>>>>>>>>>
>>>>>>>>> Kirill, can we enable and disable this feature in runtime to
>>>>>>>>> avoid the
>>>>>>>> very
>>>>>>>>> same nodes restart?
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>
>>>>>>>> https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/