Chris Riches <[email protected]> writes:

> On 15/04/2024 14:39, Jon Kohler wrote:
>>> On Apr 11, 2024, at 9:43 AM, Chris Riches <[email protected]> wrote:
>>>
>>> On 11/04/2024 14:24, Ilya Maximets wrote:
>>>> On 4/11/24 10:59, Chris Riches wrote:
>>>>>  From what we know so far, the DB was full of stale connection-tracking
>>>>> information such as the following:
>>>>>
>>>>> [...]
>>>>>
>>>>> Once the host was recovered by putting in the timeout increase,
>>>>> ovsdb-server successfully started and GCed the database down from 2.4
>>>>> *GB* to 29 *KB*. Had this happened before the host restart, we would
>>>>> have never seen this problem. But since it seems possible to end up
>>>>> booting with such a large DB, we figured a timeout increase was a
>>>>> sensible measure to take.
>>>> Uff.  Sounds like ovn-controller went off the rails.
>>>>
>>>> Normally, ovsdb-server compacts the database once in 10-20 minutes,
>>>> if the database doubles the size since the previous check.  If all
>>>> the transactions are that small, it would mean ovn-controller made
>>>> about 10K transactions per second in the 10-20 minutes before the
>>>> restart.  That's huge.
>>>>
>>>> I wonder if this can be addressed with a better compaction strategy.
>>>> Something like forcing compaction if "the database is more than 10 MB
>>>> and increased 10x" regardless of the time.
>>> I'm not sure exactly what the test was doing when this was
>>> observed, so I don't know whether that transaction volume is within
>>> the realm of possibility or if we're looking at a failure to
>>> perform compaction on time. It would be nice to have an enhanced
>>> safety-net for DB size, as we were only a few hundred MB away from
>>> hitting filesystem space issues as well.
>>>
>>>> Normally, ovsdb-server compacts the database once in 10-20
>>>> minutes, if the database doubles the size since the previous
>>>> check.
>>> I presume you mean if it doubled in size since the previous
>>> *compaction*? If we only compact when it doubles since the last
>>> *check*, then it would be easy for it to slightly-less-than-double
>>> every 10-20 minutes and never trigger the compaction while still
>>> growing exponentially.
>>>
>>> I'm happy to discuss compaction approaches (though my expertise is
>>> very much in host service management and not OVS itself), but do
>>> you think there's merit in having this extended timeout as a
>>> backstop too?
>> FWIW, I think we should do both extending the time out and tuning up the
>> compaction, as having a situation where a service can get in an endless
>> loop if for whatever reason it takes too long is problematic. Addressing
>> the root cause (compaction, too many calls, some other bug(s) etc) is
>> good, but extending the timeout seems like an easy backstop.
>
> I agree with Jon's assessment - regardless of any action taken on
> compaction or preventing growth in the first place, we should consider
> the proposed timeout increase as a backstop against getting stuck in
> an infinite loop.
>
> Ilya (or another maintainer) - can I get an opinion on this?

From my side, it looks fine.  I don't think we ever saw the DB taking
this long on startup, so never considered that it could (maybe it is the
case that compaction also occurs on graceful exits?  I don't know ovsdb
that well).

At least from my side, it seems fine.

> Thanks,
> Chris

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to