On 4/23/24 13:10, Ilya Maximets wrote:
> On 4/23/24 12:35, Simon Horman wrote:
>> On Thu, Apr 18, 2024 at 03:35:06PM +0100, Chris Riches wrote:
>>> On 15/04/2024 14:39, Jon Kohler wrote:
>>>>> On Apr 11, 2024, at 9:43 AM, Chris Riches <[email protected]> 
>>>>> wrote:
>>>>>
>>>>> On 11/04/2024 14:24, Ilya Maximets wrote:
>>>>>> On 4/11/24 10:59, Chris Riches wrote:
>>>>>>>  From what we know so far, the DB was full of stale connection-tracking
>>>>>>> information such as the following:
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>> Once the host was recovered by putting in the timeout increase,
>>>>>>> ovsdb-server successfully started and GCed the database down from 2.4
>>>>>>> *GB* to 29 *KB*. Had this happened before the host restart, we would
>>>>>>> have never seen this problem. But since it seems possible to end up
>>>>>>> booting with such a large DB, we figured a timeout increase was a
>>>>>>> sensible measure to take.
>>>>>> Uff.  Sounds like ovn-controller went off the rails.
>>>>>>
>>>>>> Normally, ovsdb-server compacts the database once in 10-20 minutes,
>>>>>> if the database doubles the size since the previous check.  If all
>>>>>> the transactions are that small, it would mean ovn-controller made
>>>>>> about 10K transactions per second in the 10-20 minutes before the
>>>>>> restart.  That's huge.
>>>>>>
>>>>>> I wonder if this can be addressed with a better compaction strategy.
>>>>>> Something like forcing compaction if "the database is more than 10 MB
>>>>>> and increased 10x" regardless of the time.
>>>>> I'm not sure exactly what the test was doing when this was observed, so I 
>>>>> don't know whether that transaction volume is within the realm of 
>>>>> possibility or if we're looking at a failure to perform compaction on 
>>>>> time. It would be nice to have an enhanced safety-net for DB size, as we 
>>>>> were only a few hundred MB away from hitting filesystem space issues as 
>>>>> well.
>>>>>
>>>>>> Normally, ovsdb-server compacts the database once in 10-20 minutes, if 
>>>>>> the database doubles the size since the previous check.
>>>>> I presume you mean if it doubled in size since the previous *compaction*? 
>>>>> If we only compact when it doubles since the last *check*, then it would 
>>>>> be easy for it to slightly-less-than-double every 10-20 minutes and never 
>>>>> trigger the compaction while still growing exponentially.
>>>>>
>>>>> I'm happy to discuss compaction approaches (though my expertise is very 
>>>>> much in host service management and not OVS itself), but do you think 
>>>>> there's merit in having this extended timeout as a backstop too?
>>>> FWIW, I think we should do both extending the time out and tuning up the
>>>> compaction, as having a situation where a service can get in an endless
>>>> loop if for whatever reason it takes too long is problematic. Addressing
>>>> the root cause (compaction, too many calls, some other bug(s) etc) is
>>>> good, but extending the timeout seems like an easy backstop.
>>>
>>> I agree with Jon's assessment - regardless of any action taken on compaction
>>> or preventing growth in the first place, we should consider the proposed
>>> timeout increase as a backstop against getting stuck in an infinite loop.
>>>
>>> Ilya (or another maintainer) - can I get an opinion on this?
>>
>> Yes, I agree that the timeout increase is a good idea.
>>
>> Acked-by: Simon Horman <[email protected]>
>>
> 
> Sorry for delay, been off for a week.  I agree that timeout increase
> makes sense since we know the mechanism for the occurrence of the issue.
> 
> I plan to catch up on the rest of the thread and apply the fix later today.

Applied to main and backported down to 2.17.  Thanks!

Best regards, Ilya Maximets.

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to