Hi OVS and OVN devs,
The OVN team has considered the idea of moving inactivity probes (i.e.
OVSDB echo requests/replies) into a background thread.
OVN logical networks can be very large, meaning that ovn components such
as ovn-northd and ovn-controller may take a while to process everything
in an OVSDB database. On large clusters, we end up seeing the following
loop occur:
1. The OVN component connects to the database.
2. The OVN component must compute the entire contents of database.
3. While the OVN component is executing its main loop, the inactivity
probe interval expires. The OVN component is disconnected from database.
4. The OVN component finishes its computation.
5. Since the OVN component is disconnected from the database, it must
reconnect. Go to step 1.
This makes for an unstable and slow experience. Typically if OVN can get
past the initial loop after connecting to the database, then incremental
processing will allow for subsequent loops to execute much more quickly.
However, the constant disconnect-reconnect makes OVN operate at its
slowest at all times.
The way we've dealt with this before is to try to optimize the
performance of OVN components, while also advising that the inactivity
probe gets set to a high value. The problem is that as the demand for
larger and larger logical networks grows, the execution time of OVN is
hard to bring down much more, but the inactivity probes have to keep
getting higher and higher to avoid the described scenario. Once OVN
reaches its "stable" state where incremental processing makes loops
execute quickly, this high inactivity probe becomes detrimental. It
means that if there is a legitimate disconnection, then we don't detect
it very quickly.
As mentioned at the top of the email, a possible solution is to put the
inactivity probes into a background thread. Is this in the spirit of the
inactivity probe? From my point of view, the inactivity probe should
fail only in a serious error condition, such as a network outage, or a
program crash. If a program is "busy" it is still "active" and should
therefore not be subject to inactivity probe failures. However, I want
to get the opinions of the list on this.
Thanks,
Mark Michelson
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev