Chris Riches <[email protected]> writes: > On 15/04/2024 14:39, Jon Kohler wrote: >>> On Apr 11, 2024, at 9:43 AM, Chris Riches <[email protected]> wrote: >>> >>> On 11/04/2024 14:24, Ilya Maximets wrote: >>>> On 4/11/24 10:59, Chris Riches wrote: >>>>> From what we know so far, the DB was full of stale connection-tracking >>>>> information such as the following: >>>>> >>>>> [...] >>>>> >>>>> Once the host was recovered by putting in the timeout increase, >>>>> ovsdb-server successfully started and GCed the database down from 2.4 >>>>> *GB* to 29 *KB*. Had this happened before the host restart, we would >>>>> have never seen this problem. But since it seems possible to end up >>>>> booting with such a large DB, we figured a timeout increase was a >>>>> sensible measure to take. >>>> Uff. Sounds like ovn-controller went off the rails. >>>> >>>> Normally, ovsdb-server compacts the database once in 10-20 minutes, >>>> if the database doubles the size since the previous check. If all >>>> the transactions are that small, it would mean ovn-controller made >>>> about 10K transactions per second in the 10-20 minutes before the >>>> restart. That's huge. >>>> >>>> I wonder if this can be addressed with a better compaction strategy. >>>> Something like forcing compaction if "the database is more than 10 MB >>>> and increased 10x" regardless of the time. >>> I'm not sure exactly what the test was doing when this was >>> observed, so I don't know whether that transaction volume is within >>> the realm of possibility or if we're looking at a failure to >>> perform compaction on time. It would be nice to have an enhanced >>> safety-net for DB size, as we were only a few hundred MB away from >>> hitting filesystem space issues as well. >>> >>>> Normally, ovsdb-server compacts the database once in 10-20 >>>> minutes, if the database doubles the size since the previous >>>> check. >>> I presume you mean if it doubled in size since the previous >>> *compaction*? If we only compact when it doubles since the last >>> *check*, then it would be easy for it to slightly-less-than-double >>> every 10-20 minutes and never trigger the compaction while still >>> growing exponentially. >>> >>> I'm happy to discuss compaction approaches (though my expertise is >>> very much in host service management and not OVS itself), but do >>> you think there's merit in having this extended timeout as a >>> backstop too? >> FWIW, I think we should do both extending the time out and tuning up the >> compaction, as having a situation where a service can get in an endless >> loop if for whatever reason it takes too long is problematic. Addressing >> the root cause (compaction, too many calls, some other bug(s) etc) is >> good, but extending the timeout seems like an easy backstop. > > I agree with Jon's assessment - regardless of any action taken on > compaction or preventing growth in the first place, we should consider > the proposed timeout increase as a backstop against getting stuck in > an infinite loop. > > Ilya (or another maintainer) - can I get an opinion on this?
From my side, it looks fine. I don't think we ever saw the DB taking this long on startup, so never considered that it could (maybe it is the case that compaction also occurs on graceful exits? I don't know ovsdb that well). At least from my side, it seems fine. > Thanks, > Chris _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
