Re: [ovs-dev] [PATCH] bond: Fix inaccurate log info in bond_shift_load.

Vladislav Odintsov Wed, 12 Mar 2025 11:07:16 -0700

Sorry, I hit the send button too early.

On 12.03.2025 20:49, Vladislav Odintsov wrote:
>
> On 12.03.2025 19:16, Ilya Maximets wrote:
>> On 3/12/25 16:01, Vladislav Odintsov wrote:
>>> On 12.03.2025 17:45, Ilya Maximets wrote:
>>>> On 3/12/25 12:41, Vladislav Odintsov wrote:
>>>>> On 12.03.2025 14:11, Ilya Maximets wrote:
>>>>>> On 3/12/25 11:22, Vladislav Odintsov wrote:
>>>>>>> Hi Han, Ilya and Adrian,
>>>>>>>
>>>>>>> I'm also suffering from these messages in logs. Han, do you have 
>>>>>>> plans
>>>>>>> on working with requested changes?
>>>>>>>
>>>>>>> If you've got no time at the moment, I can continue this work 
>>>>>>> addressing
>>>>>>> reviewers' comments. Ilya, you said 1kB is a small chunk of data 
>>>>>>> and we
>>>>>>> shouldn't care too much about it. Which value is okay to ignore 
>>>>>>> from
>>>>>>> your perspective?
>>>>>> Are you suffering from excessive logging or from not enough 
>>>>>> granularity
>>>>>> in the logs?  This patch increases the accuracy and doesn't 
>>>>>> reduce the
>>>>>> amount of logs.
>>>>>>
>>>>>> If the amount of logs is a problem, then maybe we should move the 
>>>>>> logs
>>>>>> for individual load shifts to DBG level (maybe keep a 
>>>>>> particularly heavy
>>>>>> ones like >1MB move at INFO) and have a separate overview of the 
>>>>>> change
>>>>>> instead showing the load distribution after the rebalancing, e.g.:
>>>>>>
>>>>>>     bond X: rebalanced (now carrying: portA:XXXkB, portB:YYYkB, 
>>>>>> portC:ZZZkB)
>>>>>>
>>>>>> For the debug logs we could improve the accuracy as this patch does.
>>>>>>
>>>>>> What do you think?
>>>>> We suffer from excessive logging of "0 kB" records. There are tons of
>>>>> them. So I planned to take your proposal and do not to log these 
>>>>> events
>>>>> at all. But I like your last idea about hiding messages of 
>>>>> re-balancing
>>>>> <1MB under DBG log level and keep INFO for others.
>>>>>
>>>>> If you guys are okay with that, I can prepare and send a patch.
>>>> I think, if we're hiding some information under DBG, then we need 
>>>> to add
>>>> a new log with an overview of the rebalance with a new load 
>>>> distribution.
>>>> With that, sounds fine to me.
>>> Did you mean that we should log total cumulative re-balanced amount 
>>> once
>>> in a timeframe (let say 5-15 minutes) or just log with some rate-limit?
>> The total cumulative amount at the end of 
>> ofproto/bond.c:bond_rebalance()
>> if rebalanced == true.  See the example log above.  So, it the log 
>> should
>> appear at most once per rebalance interval, which is 10s by default.
>>
>> The current 'shift X of load' log will remain but at DBG level, 
>> except for
>> a very large ones.
>>
>> Will that solve your problem, or is that still too much of logging (once
>> in 10s) ?
>
> 50/50 :)
>
> # grep 2025-03-12 /var/log/openvswitch/ovs-vswitchd.log | wc -l
> 47388
> # grep 2025-03-12 /var/log/openvswitch/ovs-vswitchd.log | grep shift -c
> 47339
> # grep 2025-03-12 /var/log/openvswitch/ovs-vswitchd.log | grep 'shift 
> 0kB' | wc -l
> 20384
> # grep 2025-03-12 /var/log/openvswitch/ovs-vswitchd.log | grep shift | 
> cut -d '|' -f1 | uniq -c | sort -n | sed -e 's/^[[:space:]]*//' | cut 
> -d ' ' -f1 | uniq -c | sort -n
>       1 102
>       1 108
>       1 109
>       1 110
>       1 112
>       1 116
>       1 117
>       1 118
>       1 119
>       1 120
>       1 123
>       1 124
>       1 127
>       1 132
>       1 133
>       1 135
>       1 136
>       1 139
>       1 142
>       1 151
>       1 153
>       1 155
>       1 159
>       1 161
>       1 163
>       1 164
>       1 166
>       1 169
>       1 170
>       1 171
>       1 172
>       1 173
>       1 174
>       1 175
>       1 179
>       1 182
>       1 185
>       1 186
>       1 187
>       1 188
>       1 191
>       1 193
>       1 194
>       1 196
>       1 198
>       1 202
>       1 207
>       1 209
>       1 211
>       1 212
>       1 213
>       1 217
>       1 220
>       1 222
>       1 223
>       1 224
>       1 227
>       1 228
>       1 229
>       1 230
>       1 231
>       1 233
>       1 235
>       1 236
>       1 238
>       1 242
>       1 245
>       1 25
>       1 34
>       1 39
>       1 42
>       1 47
>       1 48
>       1 50
>       1 54
>       1 56
>       1 57
>       1 62
>       1 65
>       1 66
>       1 71
>       1 74
>       1 75
>       1 78
>       1 82
>       1 84
>       1 85
>       1 86
>       1 89
>       1 92
>       1 98
>       2 103
>       2 105
>       2 114
>       2 126
>       2 128
>       2 138
>       2 157
>       2 165
>       2 167
>       2 189
>       2 190
>       2 197
>       2 199
>       2 20
>       2 200
>       2 21
>       2 210
>       2 215
>       2 225
>       2 226
>       2 239
>       2 244
>       2 247
>       2 248
>       2 249
>       2 250
>       2 26
>       2 40
>       2 46
>       2 51
>       2 52
>       2 53
>       2 55
>       2 58
>       2 59
>       2 63
>       2 67
>       2 70
>       2 81
>       2 83
>       2 88
>       2 90
>       2 91
>       2 99
>       3 115
>       3 192
>       3 204
>       3 216
>       3 240
>       3 251
>       3 252
>       3 254
>       3 27
>       3 28
>       3 29
>       3 30
>       3 32
>       3 33
>       3 41
>       3 43
>       3 44
>       3 49
>       3 68
>       3 69
>       3 76
>       3 77
>       3 96
>       3 97
>       4 17
>       4 208
>       4 24
>       4 241
>       4 35
>       4 37
>       4 60
>       4 64
>       5 152
>       5 232
>       5 243
>       5 36
>       6 19
>       6 23
>       6 38
>       8 22
>      11 16
>      12 11
>      12 15
>      13 12
>      13 13
>      13 14
>      14 18
>      19 9
>      20 8
>      25 10
>      42 7
>      70 6
>     112 5
>     205 4
>     449 3
>     741 1
>    1034 2
>
> The last command shows the amount (1st column) of shift events in one 
> microsecond. Like this:


Sorry, there was an error:

The *second* column (value) shows the total amount of shift log lines 
occurred in any one millisecond in a log by last day.
The *first* column (key) shows the total times in a log by last day 
where there was a (value) events in one millisecond.

Like this:

2025-03-12T17:49:12.307Z|2772298|bond|INFO|bond directconnect: shift 
407kB of load (with hash 40) from eth6 to eth7 (now carrying 1283kB and 
935kB load, respectively)
2025-03-12T17:49:12.307Z|2772299|bond|INFO|bond directconnect: shift 1kB 
of load (with hash 82) from eth6 to eth7 (now carrying 1282kB and 936kB 
load, respectively)
2025-03-12T17:49:12.307Z|2772300|bond|INFO|bond directconnect: shift 0kB 
of load (with hash 85) from eth6 to eth7 (now carrying 1281kB and 937kB 
load, respectively)
2025-03-12T17:49:12.307Z|2772301|bond|INFO|bond directconnect: shift 0kB 
of load (with hash 45) from eth6 to eth7 (now carrying 1280kB and 938kB 
load, respectively)
2025-03-12T17:49:12.307Z|2772302|bond|INFO|bond directconnect: shift 0kB 
of load (with hash 233) from eth6 to eth7 (now carrying 1280kB and 938kB 
load, respectively)
2025-03-12T17:49:12.307Z|2772303|bond|INFO|bond directconnect: shift 0kB 
of load (with hash 19) from eth6 to eth7 (now carrying 1279kB and 938kB 
load, respectively)
2025-03-12T17:49:12.307Z|2772304|bond|INFO|bond directconnect: shift 0kB 
of load (with hash 1) from eth6 to eth7 (now carrying 1279kB and 939kB 
load, respectively)
2025-03-12T17:49:12.307Z|2772305|bond|INFO|bond directconnect: shift 0kB 
of load (with hash 227) from eth6 to eth7 (now carrying 1279kB and 939kB 
load, respectively)
2025-03-12T17:49:12.307Z|2772306|bond|INFO|bond directconnect: shift 0kB 
of load (with hash 76) from eth6 to eth7 (now carrying 1279kB and 939kB 
load, respectively)
2025-03-12T17:49:12.307Z|2772307|bond|INFO|bond directconnect: shift 0kB 
of load (with hash 14) from eth6 to eth7 (now carrying 1279kB and 939kB 
load, respectively)

This server lightly loaded (<100mb/s of network traffic).

Additionally:

# grep 2025-03-12 /var/log/openvswitch/ovs-vswitchd.log | grep shift | 
cut -d '|' -f1 | uniq -c | sort -n
...(a lot of lines).
     243 2025-03-12T06:32:25.462Z
     243 2025-03-12T06:47:36.978Z
     243 2025-03-12T15:28:17.831Z
     244 2025-03-12T07:50:35.699Z
     244 2025-03-12T13:22:43.546Z
     245 2025-03-12T12:46:39.905Z
     247 2025-03-12T06:51:57.431Z
     247 2025-03-12T08:49:12.899Z
     248 2025-03-12T07:24:31.333Z
     248 2025-03-12T14:40:52.260Z
     249 2025-03-12T13:24:33.725Z
     249 2025-03-12T14:23:29.867Z
     250 2025-03-12T08:22:30.169Z
     250 2025-03-12T13:37:25.005Z
     251 2025-03-12T07:15:20.215Z
     251 2025-03-12T13:10:22.321Z
     251 2025-03-12T14:24:39.985Z
     252 2025-03-12T07:57:47.565Z
     252 2025-03-12T14:40:32.224Z
     252 2025-03-12T15:26:17.663Z
     254 2025-03-12T08:05:08.456Z
     254 2025-03-12T08:21:50.104Z
     254 2025-03-12T12:42:09.455Z

The 1st column here is the total shift log lines from 1 millisecond 
(shown in 2nd column).

So, from my example, if there is a maximum 1 log line per 10s, the 
logfile still will be filled with only rebalance lines, but the total 
amount will be noticeably reduced.

By the way, why do we need this information in log? What can we conclude 
from this information?

>
>
> I guess we should not log 0kb movement at all.
>
>>
>>>> Note: chinatelecom.cn mail server doesn't seem to like 
>>>> communicating with
>>>> gmail smtp, and so my replies likely do not reach Han. Let's wait 
>>>> maybe a
>>>> couple days for a reply, in case at least some of these messages 
>>>> are delivered.
>>>>
>>>>>> Best regards, Ilya Maximets.
>>>>>>
>>>>>>> On 05.08.2024 10:08, Adrián Moreno wrote:
>>>>>>>> On Fri, Aug 02, 2024 at 06:38:19PM GMT, Ilya Maximets wrote:
>>>>>>>>> On 7/2/24 16:36, Han Ding wrote:
>>>>>>>>>> When the delta is less than 1024 in bond_shift_load, it print 
>>>>>>>>>> "shift 0kB of load".
>>>>>>>>>> Like this:
>>>>>>>>>> bond dpdkbond0: shift 0kB of load (with hash 71) from nic1 to 
>>>>>>>>>> nic2 (now carrying 20650165kB and 8311662kB load, respectively)
>>>>>>>>> Hi, Han.  Thanks for the patch!
>>>>>>>>>
>>>>>>>>> I'm curious, is this information about movements that small
>>>>>>>>> practically useful?
>>>>>>>>>
>>>>>>>> I had the same thought.
>>>>>>>> I guess it should not happen very often given how the algorithm 
>>>>>>>> works but
>>>>>>>> OTOH, printing "0kB" is definitely not useful.
>>>>>>>>
>>>>>>>>>> Signed-off-by: Han Ding <hand...@chinatelecom.cn>
>>>>>>>>>> ---
>>>>>>>>>>      ofproto/bond.c | 24 +++++++++++++++++-------
>>>>>>>>>>      1 file changed, 17 insertions(+), 7 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/ofproto/bond.c b/ofproto/bond.c
>>>>>>>>>> index c31869a..5b1975d 100644
>>>>>>>>>> --- a/ofproto/bond.c
>>>>>>>>>> +++ b/ofproto/bond.c
>>>>>>>>>> @@ -1192,13 +1192,23 @@ bond_shift_load(struct bond_entry 
>>>>>>>>>> *hash, struct bond_member *to)
>>>>>>>>>>          struct bond *bond = from->bond;
>>>>>>>>>>          uint64_t delta = hash->tx_bytes;
>>>>>>>>>>
>>>>>>>>>> -    VLOG_INFO("bond %s: shift %"PRIu64"kB of load (with hash 
>>>>>>>>>> %"PRIdPTR") "
>>>>>>>>>> -              "from %s to %s (now carrying %"PRIu64"kB and "
>>>>>>>>>> -              "%"PRIu64"kB load, respectively)",
>>>>>>>>>> -              bond->name, delta / 1024, hash - bond->hash,
>>>>>>>>>> -              from->name, to->name,
>>>>>>>>>> -              (from->tx_bytes - delta) / 1024,
>>>>>>>>>> -              (to->tx_bytes + delta) / 1024);
>>>>>>>>>> +    if (delta >= 1024) {
>>>>>>>>>> +        VLOG_INFO("bond %s: shift %"PRIu64"kB of load (with 
>>>>>>>>>> hash %"PRIdPTR") "
>>>>>>>>>> +                "from %s to %s (now carrying %"PRIu64"kB and "
>>>>>>>>>> +                "%"PRIu64"kB load, respectively)",
>>>>>>>>>> +                bond->name, delta / 1024, hash - bond->hash,
>>>>>>>>>> +                from->name, to->name,
>>>>>>>>>> +                (from->tx_bytes - delta) / 1024,
>>>>>>>>>> +                (to->tx_bytes + delta) / 1024);
>>>>>>>>>> +    } else {
>>>>>>>>>> +        VLOG_INFO("bond %s: shift %"PRIu64"B of load (with 
>>>>>>>>>> hash %"PRIdPTR") "
>>>>>>>>>> +                "from %s to %s (now carrying %"PRIu64"kB and "
>>>>>>>> Apart from Ilya's comment, missing one more indentation space 
>>>>>>>> in this
>>>>>>>> line (and all others).
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>>>> + "%"PRIu64"kB load, respectively)",
>>>>>>>>>> +                bond->name, delta, hash - bond->hash,
>>>>>>>>>> +                from->name, to->name,
>>>>>>>>>> +                (from->tx_bytes - delta) / 1024,
>>>>>>>>>> +                (to->tx_bytes + delta) / 1024);
>>>>>>>>>> +    }
>>>>>>>>> I'd suggest instead of copying the whole thing, just replace "kB"
>>>>>>>>> with another %s and use a couple of ternary operators to produce
>>>>>>>>> correct value and "kB" or "B" depending on the delta.
>>>>>>>>>
>>>>>>>>> Best regards, Ilya Maximets.
>>>>>>>>> _______________________________________________
>>>>>>>>> dev mailing list
>>>>>>>>> d...@openvswitch.org
>>>>>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> dev mailing list
>>>>>>>> d...@openvswitch.org
>>>>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
-- 
Regards,
Vladislav Odintsov

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] bond: Fix inaccurate log info in bond_shift_load.

Reply via email to