Re: [ovs-dev] OVSDB echo requests/replies in a background thread: should we?

Mark Michelson Tue, 09 Aug 2022 12:53:26 -0700

On 8/9/22 15:12, Ilya Maximets wrote:

On 8/9/22 20:59, Mark Michelson wrote:

On 8/8/22 09:31, Ilya Maximets wrote:

On 8/5/22 08:28, Han Zhou wrote:

On Thu, Jun 30, 2022 at 12:55 PM Mark Michelson <[email protected]> wrote:


Hi OVS and OVN devs,

The OVN team has considered the idea of moving inactivity probes (i.e.
OVSDB echo requests/replies) into a background thread.

OVN logical networks can be very large, meaning that ovn components such
as ovn-northd and ovn-controller may take a while to process everything
in an OVSDB database. On large clusters, we end up seeing the following
loop occur:

1. The OVN component connects to the database.
2. The OVN component must compute the entire contents of database.
3. While the OVN component is executing its main loop, the inactivity
probe interval expires. The OVN component is disconnected from database.
4. The OVN component finishes its computation.
5. Since the OVN component is disconnected from the database, it must
reconnect. Go to step 1.

This makes for an unstable and slow experience. Typically if OVN can get
past the initial loop after connecting to the database, then incremental
processing will allow for subsequent loops to execute much more quickly.
However, the constant disconnect-reconnect makes OVN operate at its
slowest at all times.

The way we've dealt with this before is to try to optimize the
performance of OVN components, while also advising that the inactivity
probe gets set to a high value. The problem is that as the demand for
larger and larger logical networks grows, the execution time of OVN is
hard to bring down much more, but the inactivity probes have to keep
getting higher and higher to avoid the described scenario. Once OVN
reaches its "stable" state where incremental processing makes loops
execute quickly, this high inactivity probe becomes detrimental. It
means that if there is a legitimate disconnection, then we don't detect
it very quickly.

Hi Mark,

Sorry for the late reply.
For the slowness of ovn-northd/ovn-controller, our practice is to disable
the inactivity probe from DB server side. It seems harmless because if the
client network is recovered it would just reconnect, right?
The problem is more with the other direction: probe from client to server.
This probe is required because if a client (e.g. ovn-controller) doesn't
need to send any transactions to the server for a long time then it would
not detect a server crash sooner, thus would not trigger reconnecting to
another server in the cluster in time.
With the client -> server probe enabled, there is a scale problem with the
SB server, if it is connected with a huge number of clients and if the
probe interval is not high enough, because when it is busy serving the
clients it may fail responding the probes in time, causing some clients
reconnecting and re-transmitting data, a cascaded failure.
So for the background thread approach may appear to be helpful for the
server side.
However, there is already an alternative to the server side scale problem
(I haven't tried yet), the OVSDB relay, which can reduce the number of
clients per server to a very low number. In this case, a short probe should
not matter.
So based on the above thoughts, I am not sure if it is really necessary to
have the background probe handling.

As mentioned at the top of the email, a possible solution is to put the
inactivity probes into a background thread. Is this in the spirit of the
inactivity probe? From my point of view, the inactivity probe should
fail only in a serious error condition, such as a network outage, or a
program crash. If a program is "busy" it is still "active" and should
therefore not be subject to inactivity probe failures. However, I want
to get the opinions of the list on this.


This is indeed a controversial point. If a program is "busy" but "active",
I agree it shouldn't be subject to probe failures. However, if the program
is not responsive at all, due to bugs, e.g. the main thread is in a dead
loop (although it is not very likely to happen in OVN components), should
the probe fail? I think the answer is yes in this case, considering the
case when a SB server node is not responsive at all because of bugs while
the background thread is still responding to probes, the client wouldn't
notice the problem and would not reconnect to a healthy node, which defies
the purpose of the probe.


I think, there is one more thing we should keep in mind - plain old TCP
keepalive functionality.  Instead of implementing some special background
threads, users can just disable all inactivity probes and LD_PRELOAD
keepalive library with desired configuration that will be automatically
applied to all connections and it will be kernel's responsibility to
handle probes.  IIUC, that should cover all the same cases as a background
thread, and will have the same drawbacks of not detecting the process being
stuck in an infinite loop.

Best regards, Ilya Maximets.


A long time ago (July 2020), Anton Ivanov provided a patch that enabled 
keepalives at the kernel level. This is the latest version of the patch I could 
find: https://mail.openvswitch.org/pipermail/ovs-dev/2020-June/371262.html

I had some findings on that particular patch, and AFAIK, there was never a 
follow-up patch sent (although I may have missed it when searching).

Would something like that be a viable alternative?


The beauty of this is that you don't need to change the code!
Just install libkeepalive in you favorite distribution and
LD_PREVOAD the library while starting the application.
See some examples here:
   https://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#libkeepalive

Best regards, Ilya Maximets.

Wow, so if I understand correctly, we could handle this entirely inovn-ctl and the systemd service files, and we'd get inactivity probesthat work independent of what the application is doing. That's pretty nifty.

It doesn't address Han's concern about timing out if the code is in adead loop or is deadlocked. But that's the sort of thing a localwatchdog could take care of instead.


_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] OVSDB echo requests/replies in a background thread: should we?

Reply via email to