On 8/22/22 12:21, Dumitru Ceara wrote:
> Hi Ilya,
> 
> On 8/22/22 11:19, Ilya Maximets wrote:
>> daemon_started_recently() concept is flawed in terms that it uses
>> fixed number of iterations for a countdown and a fixed timeout, so
>> the undesired removal of configured resources can still happen under
>> certain conditions even with this mechanism in place.
>>
>> The root cause of the original problem is that conditional monitoring
>> in ovn-controller works in stages.  ovn-controller doesn't know all
>> the information it needs to monitor until the first monitor reply
>> arrives, then it adjusts conditions and receives all the data.  That
>> doesn't work well with the ovsdb_idl_has_ever_connected() check that
>> will return 'true' right after the first monitor reply.
>> This leads to situation where resources can be removed after the
>> first monitor reply and re-added back after condition change.
>>
>> Instead of waiting a fixed number of iterations/seconds, the correct
>> solution should be to always start with unconditional monitoring.
>> When the full initial snapshot is received, switch to conditional
>> monitoring as data is complete.  This way ovsdb_idl_has_ever_connected()
>> check will provide an accurate answer to the question: if we have
>> a sufficient amount of data to make decisions about resource removal?
> 
> This does sound like the cleanest solution, I agree.  But does it create
> an issue in large scale deployments where conditional monitoring is
> used?  The full initial SB snapshot is likely quite large.

It can be large, yes, though it shouldn't be critical.  We will not hold
all this memory for long.  If it is critical, we can fall back on the
"Alternative solution" below.  It should not be hard to implement.
I'll wait for the discussion here to conclude before working on that.

> 
>>
>> Once switched to conditional monitoring, southbound database will
>> remove all the unnecessary data from the ovn-controller, so it will
>> not need to keep it for long.  And since setup is using conditional
>> monitoring, it's likely not that big anyway to be concerned about
>> a brief spike in memory consumption on startup.
>>
> 
> For OpenShift and RedHat OpenStack this stands, we use
> ovn-monitor-all=true for now.  But I'm not sure about other large scale
> deployments.  I'll let Han comment more on this too.
> 
>> This change will also remove unnecessary waiting on startup when all
>> the data is already available.
>>
>> The main change is in the ovsdb_idl_has_ever_connected() check in the
>> update_sb_monitors(), the rest is mostly a revert of commits that
>> introduced the startup delay.
>>
>> Test enhanced to run with and w/o conditional monitoring, since the
>> issue is closely related it it.
>>
> 
> Good idea.
> 
>> Alternative solution might be to store the first condition change
>> sequence number and replace all ovsdb_idl_has_ever_connected() calls
>> with a function that also checks for the current sequence number being
>> higher than condition change one, but that will require slightly more
>> code.  Can be implemented in the future, if the first unconditional
>> monitor reply will become a problem some day.
>>
>> Fixes: 2b27eb3482a3 ("patch.c: Avoid patch interface deletion & recreation 
>> during restart.")
> 
> It seems to me we need to do something similar for OpenFlow clearing
> too.  Right now we wait a configured amount of time:
> 
> https://github.com/ovn-org/ovn/commit/896adfd2d

Hmm, yes.  It seems like with the correct handling of Sb updates in place
we should be able to just implement an "ideal" solution:

"""
  Ideally, ovn-controller should clear the flows after it
  completes the new flows computing.
"""

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to