On 4/23/24 13:10, Ilya Maximets wrote: > On 4/23/24 12:35, Simon Horman wrote: >> On Thu, Apr 18, 2024 at 03:35:06PM +0100, Chris Riches wrote: >>> On 15/04/2024 14:39, Jon Kohler wrote: >>>>> On Apr 11, 2024, at 9:43 AM, Chris Riches <[email protected]> >>>>> wrote: >>>>> >>>>> On 11/04/2024 14:24, Ilya Maximets wrote: >>>>>> On 4/11/24 10:59, Chris Riches wrote: >>>>>>> From what we know so far, the DB was full of stale connection-tracking >>>>>>> information such as the following: >>>>>>> >>>>>>> [...] >>>>>>> >>>>>>> Once the host was recovered by putting in the timeout increase, >>>>>>> ovsdb-server successfully started and GCed the database down from 2.4 >>>>>>> *GB* to 29 *KB*. Had this happened before the host restart, we would >>>>>>> have never seen this problem. But since it seems possible to end up >>>>>>> booting with such a large DB, we figured a timeout increase was a >>>>>>> sensible measure to take. >>>>>> Uff. Sounds like ovn-controller went off the rails. >>>>>> >>>>>> Normally, ovsdb-server compacts the database once in 10-20 minutes, >>>>>> if the database doubles the size since the previous check. If all >>>>>> the transactions are that small, it would mean ovn-controller made >>>>>> about 10K transactions per second in the 10-20 minutes before the >>>>>> restart. That's huge. >>>>>> >>>>>> I wonder if this can be addressed with a better compaction strategy. >>>>>> Something like forcing compaction if "the database is more than 10 MB >>>>>> and increased 10x" regardless of the time. >>>>> I'm not sure exactly what the test was doing when this was observed, so I >>>>> don't know whether that transaction volume is within the realm of >>>>> possibility or if we're looking at a failure to perform compaction on >>>>> time. It would be nice to have an enhanced safety-net for DB size, as we >>>>> were only a few hundred MB away from hitting filesystem space issues as >>>>> well. >>>>> >>>>>> Normally, ovsdb-server compacts the database once in 10-20 minutes, if >>>>>> the database doubles the size since the previous check. >>>>> I presume you mean if it doubled in size since the previous *compaction*? >>>>> If we only compact when it doubles since the last *check*, then it would >>>>> be easy for it to slightly-less-than-double every 10-20 minutes and never >>>>> trigger the compaction while still growing exponentially. >>>>> >>>>> I'm happy to discuss compaction approaches (though my expertise is very >>>>> much in host service management and not OVS itself), but do you think >>>>> there's merit in having this extended timeout as a backstop too? >>>> FWIW, I think we should do both extending the time out and tuning up the >>>> compaction, as having a situation where a service can get in an endless >>>> loop if for whatever reason it takes too long is problematic. Addressing >>>> the root cause (compaction, too many calls, some other bug(s) etc) is >>>> good, but extending the timeout seems like an easy backstop. >>> >>> I agree with Jon's assessment - regardless of any action taken on compaction >>> or preventing growth in the first place, we should consider the proposed >>> timeout increase as a backstop against getting stuck in an infinite loop. >>> >>> Ilya (or another maintainer) - can I get an opinion on this? >> >> Yes, I agree that the timeout increase is a good idea. >> >> Acked-by: Simon Horman <[email protected]> >> > > Sorry for delay, been off for a week. I agree that timeout increase > makes sense since we know the mechanism for the occurrence of the issue. > > I plan to catch up on the rest of the thread and apply the fix later today.
Applied to main and backported down to 2.17. Thanks! Best regards, Ilya Maximets. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
