On 7/6/23 13:00, Vladislav Odintsov wrote:
> 
> 
>> On 5 Jul 2023, at 20:07, Vladislav Odintsov <odiv...@gmail.com> wrote:
>>
>> Hi Dumitru,
>>
>> thanks for the quick response!
>>
>>> On 5 Jul 2023, at 19:53, Dumitru Ceara <dce...@redhat.com> wrote:
>>>
>>> On 7/5/23 17:14, Vladislav Odintsov wrote:
>>>> Hi,
>>>>
>>>
>>> Hi Vladislav,
>>>
>>>> we’ve noticed there is a huge ovn-controller memory consumption introduced 
>>>> with [0] comparing to version without its changes in ovn-controller.c part 
>>>> (just OVS submodule bump without ovn-controller changes doesn’t trigger 
>>>> such behaviour).
>>>>
>>>
>>> Thanks for reporting this!
>>>
>>>> On an empty host connected to working cluster ovn-controller normally 
>>>> consumes ~175 MiB RSS, and the same host updated to a version with commit 
>>>> [0] consumes 3.3 GiB RSS. The start time and CPU consumption during 
>>>> process start of an affected version is higher that for the normal version.
>>>>
>>>> Before upgrade (normal functioning):
>>>>
>>>> #ovn-appctl -t ovn-controller memory/show ;ps-mem -p $(pidof 
>>>> ovn-controller)| grep ovn
>>>> idl-cells-OVN_Southbound:343855 idl-cells-Open_vSwitch:14131 
>>>> ofctrl_desired_flow_usage-KB:76 ofctrl_installed_flow_usage-KB:60 
>>>> ofctrl_sb_flow_ref_usage-KB:18
>>>> 174.2 MiB + 327.5 KiB = 174.5 MiB  ovn-controller
>>>>
>>>> After upgrade to an affected version I’ve checked memory report while 
>>>> ovn-controller was starting and at some time there was a bigger amount of 
>>>> OVN_Southbound cells comparing to "after start" time:
>>>>
>>>> during start:
>>>>
>>>> # ovn-appctl -t ovn-controller memory/show ;ps-mem -p $(pidof 
>>>> ovn-controller)| grep ovn
>>>> idl-cells-OVN_Southbound:11388742 idl-cells-Open_vSwitch:14131 
>>>> idl-outstanding-txns-Open_vSwitch:1
>>>> 3.3 GiB + 327.0 KiB =   3.3 GiB    ovn-controller
>>>>
>>>> after start:
>>>>
>>>> # ovn-appctl -t ovn-controller memory/show ;ps-mem -p $(pidof 
>>>> ovn-controller)| grep ovn
>>>> idl-cells-OVN_Southbound:343896 idl-cells-Open_vSwitch:14131 
>>>> idl-outstanding-txns-Open_vSwitch:1
>>>> 3.3 GiB + 327.0 KiB =   3.3 GiB    ovn-controller
>>>>
>>>>
>>>> cells during start:
>>>> 11388742
>>>>
>>>> cells after start:
>>>> 343896
>>>>
>>>
>>> Are you running with ovn-monitor-all=true on this host?
>>
>> No, it has default false.
>>
>>>
>>> I guess it's unlikely but I'll try just in case: would it be possible to
>>> share the SB database?
>>
>> Unfortunately, no. But I can say it’s about 450 M in size after compaction. 
>> And there are about 1M mac_bindings if it’s important :).
> 

I tried in a sandbox, before and after the commit in question, with a
script that adds 1M mac bindings on top of the sample topology built by
tutorial/ovn-setup.sh.

I see ovn-controller memory usage going to >3GB in before the commit you
blamed and to >1.9GB after the same commit.  So it looks different than
what you reported but still worrying that we use so much memory for mac
bindings.

I'm assuming however that quite a few of the 1M mac bindings in your
setup are stale so would it be possible to enable mac_binding aging?  It
needs to be enabled per router with something like:

 $ ovn-nbctl set logical_router RTR options:mac_binding_age_threshold=60

The threshold is in seconds and is a hard limit (for now) after which a
mac binding entry is removed.  There's work in progress to only clear
arp cache entries that are really stale [1].

> But if you are interested in any specific information, let me know it, I can 
> check.
> 

How many "local" datapaths do we have on the hosts that exhibit high
memory usage?

The quickest way I can think of to get this info is to run this on the
hypervisor:
 $ ovn-appctl ct-zone-list | grep snat -c

Additional question: how many mac_bindings are "local", i.e., associated
to local datapaths?

>>
>>>
>>>> I guess it could be connected with this problem. Can anyone look at this 
>>>> and comment please?
>>>>
>>>
>>> Does the memory usage persist after SB is upgraded too?  I see the
>>> number of SB idl-cells goes down eventually which means that eventually
>>> the periodic malloc_trim() call would free up memory.  We trim on idle
>>> since
>>> https://github.com/ovn-org/ovn/commit/b4c593d23bd959e98fcc9ada4a973ac933579ded
>>>
>>
>> In this upgrade DB schemas not upgraded, so they’re up to date.
>>
>>> Are you using a different allocator?  E.g., jemalloc.
>>
>> No, this issue reproduces with gcc.
>>

Can we run a test and see if malloc_trim() was actually called?  I'd
first disable lflow-cache log rate limiting:

 $ ovn-appctl vlog/disable-rate-limit lflow_cache

Then check if malloc_trim() was called after the lflow-cache detected
inactivity.  You'd see logs like:

"lflow_cache|INFO|Detected cache inactivity (last active 30005 ms ago):
trimming cache"

The fact that the number of SB idl-cells goes down and memory doesn't
seems to indicate we might have a bug in the auto cache trimming mechanism.

In any case, a malloc_trim() can be manually triggered by flushing the
lflow cache:

 $ ovn-appctl lflow-cache/flush

Thanks,
Dumitru

>>>
>>>>
>>>> 0: 
>>>> https://github.com/ovn-org/ovn/commit/1b0dbde940706e5de6e60221be78a278361fa76d

[1] https://patchwork.ozlabs.org/project/ovn/list/?series=359894&state=*

>>>>
>>>>
>>>>
>>>> Regards,
>>>> Vladislav Odintsov
>>>>
>>>
>>> Regards,
>>> Dumitru
>>>
>>> _______________________________________________
>>> dev mailing list
>>> d...@openvswitch.org <mailto:d...@openvswitch.org> 
>>> <mailto:d...@openvswitch.org>
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
>>
>> Regards,
>> Vladislav Odintsov
>>
>> _______________________________________________
>> dev mailing list
>> d...@openvswitch.org <mailto:d...@openvswitch.org>
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 
> 
> Regards,
> Vladislav Odintsov
> 
> 

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to