On 15/04/2024 14:39, Jon Kohler wrote:
On Apr 11, 2024, at 9:43 AM, Chris Riches <[email protected]> wrote:

On 11/04/2024 14:24, Ilya Maximets wrote:
On 4/11/24 10:59, Chris Riches wrote:
 From what we know so far, the DB was full of stale connection-tracking
information such as the following:

[...]

Once the host was recovered by putting in the timeout increase,
ovsdb-server successfully started and GCed the database down from 2.4
*GB* to 29 *KB*. Had this happened before the host restart, we would
have never seen this problem. But since it seems possible to end up
booting with such a large DB, we figured a timeout increase was a
sensible measure to take.
Uff.  Sounds like ovn-controller went off the rails.

Normally, ovsdb-server compacts the database once in 10-20 minutes,
if the database doubles the size since the previous check.  If all
the transactions are that small, it would mean ovn-controller made
about 10K transactions per second in the 10-20 minutes before the
restart.  That's huge.

I wonder if this can be addressed with a better compaction strategy.
Something like forcing compaction if "the database is more than 10 MB
and increased 10x" regardless of the time.
I'm not sure exactly what the test was doing when this was observed, so I don't 
know whether that transaction volume is within the realm of possibility or if 
we're looking at a failure to perform compaction on time. It would be nice to 
have an enhanced safety-net for DB size, as we were only a few hundred MB away 
from hitting filesystem space issues as well.

Normally, ovsdb-server compacts the database once in 10-20 minutes, if the 
database doubles the size since the previous check.
I presume you mean if it doubled in size since the previous *compaction*? If we 
only compact when it doubles since the last *check*, then it would be easy for 
it to slightly-less-than-double every 10-20 minutes and never trigger the 
compaction while still growing exponentially.

I'm happy to discuss compaction approaches (though my expertise is very much in 
host service management and not OVS itself), but do you think there's merit in 
having this extended timeout as a backstop too?
FWIW, I think we should do both extending the time out and tuning up the
compaction, as having a situation where a service can get in an endless
loop if for whatever reason it takes too long is problematic. Addressing
the root cause (compaction, too many calls, some other bug(s) etc) is
good, but extending the timeout seems like an easy backstop.

I agree with Jon's assessment - regardless of any action taken on compaction or preventing growth in the first place, we should consider the proposed timeout increase as a backstop against getting stuck in an infinite loop.

Ilya (or another maintainer) - can I get an opinion on this?


Thanks,
Chris
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to