Hi Numan, according to our experience, in such case the connection can never be fully established, and northd ends up being in an endless loop with 100% CPU utilization.
Numan Siddique <[email protected]> schrieb am Sa., 18. Sept. 2021, 00:05: > On Fri, Sep 17, 2021 at 5:02 PM Renat Nurgaliyev <[email protected]> > wrote: > > > > Hi Han, > > > > yes, I believe you are totally right. But it still feels like a chicken > and > > egg problem to me, storing the database timeout setting inside the > database > > itself. If there would be at least some local command line argument to > > override timeout value, it would be already amazing, because currently > > there is no way to control it before the database connection is made, and > > if it cannot be made, it is too late to try to control it. > > > > What about the case where the NB database is huge and it takes > 5 > seconds to fetch > all the contents ? > > Numan > > > Thanks, > > Renat. > > > > Han Zhou <[email protected]> schrieb am Fr., 17. Sept. 2021, 23:55: > > > > > > > > > > > On Fri, Sep 17, 2021 at 1:48 PM Renat Nurgaliyev <[email protected]> > > > wrote: > > > > > > > > Hello Han, > > > > > > > > when I wrote this patch we had an issue with a very big SB database, > > > around 1,5 gigabytes. There were no controllers or northds running, so > the > > > database server was without any load at all. Although OVSDB was idling, > > > even a single northd process could not fully connect to the database > due to > > > its size, since it could not fetch and process the data in 5 seconds. > > > > > > Hi Renat, thanks for the explanation. However, suppose SB is still > huge, > > > if NB is not that big, the probe config in NB_Global will soon be > applied > > > to ovn-northd, which would probe in proper interval (desired setting > with > > > the SB size considered) instead of the default 5 sec, and it should > > > succeed, right? > > > > > > Thanks, > > > Han > > > > > > > > > > > Since then many optimizations were made, and the database size with > the > > > same topology reduced to approximately twenty megabytes, so today I > > > wouldn't be able to reproduce the problem. > > > > > > > > However, I am quite sure that it would still cause troubles with a > huge > > > scale, when SB grows to hundreds of megabytes. With the default > timeout of > > > 5 seconds, which is implemented in the same thread that also fetches > and > > > processes data, we make an artificial database size limit, which is > not so > > > obvoius to troubleshoot. > > > > > > > > Regards, > > > > Renat. > > > > > > > > Han Zhou <[email protected]> schrieb am Fr., 17. Sept. 2021, 23:34: > > > >> > > > >> > > > >> > > > >> On Thu, Sep 16, 2021 at 8:05 PM Zhen Wang <[email protected]> > wrote: > > > >> > > > > >> > From: zhen wang <[email protected]> > > > >> > > > > >> > This reverts commit 1e59feea933610b28fd4442243162ce35595cfee. > > > >> > Above commit introduced a bug when muptiple ovn-northd instances > work > > > in HA > > > >> > mode. If SB leader and active ovn-northd instance got killed by > > > system power > > > >> > outage, standby ovn-northd instance would never detect the > failure. > > > >> > > > > >> > > > >> Thanks Zhen! I added the Renat and Numan who worked on the reverted > > > commit to CC, so that they can comment if this is ok. > > > >> > > > >> For the commit message, I think it may be decoupled from the HA > > > scenario that is supposed to be fixed by the other patch in this > series. > > > The issue this patch fixes is that before the initial NB downloading is > > > complete the northd will not send probe, so if the DB server is down > > > (ungracefully) before the northd reads the NB_Global options, the > northd > > > would never probe, thus never reconnect to the new leader. (it is > related > > > to RAFT, but whether it is multiple northds is irrelevant) > > > >> > > > >> As to the original commit that is reverted by this one: > > > >> > > > >> northd: Don't poll ovsdb before the connection is fully > established > > > >> > > > >> Set initial SB and NB DBs probe interval to 0 to avoid > connection > > > >> flapping. > > > >> > > > >> Before configured in northd_probe_interval value is actually > applied > > > >> to southbound and northbound database connections, both > connections > > > >> must be fully established, otherwise ovnnb_db_run() will return > > > >> without retrieving configuration data from northbound DB. In > cases > > > >> when southbound database is big enough, default interval of 5 > > > seconds > > > >> will kill and retry the connection before it is fully > established, > > > no > > > >> matter what is set in northd_probe_interval. Client reconnect > will > > > >> cause even more load to ovsdb-server and cause cascade effect, > so > > > >> northd can never stabilise. We have more than 2000 ports in our > lab, > > > >> and northd could not start before this patch, holding at 100% > CPU > > > >> utilisation both itself and ovsdb-server. > > > >> > > > >> After connections are established, any value in > > > northd_probe_interval, > > > >> or default DEFAULT_PROBE_INTERVAL_MSEC is applied correctly. > > > >> > > > >> I am not sure how would the commit help. There are at most 3 - 5 > > > northds (in practice), and suppose there are tens or hundreds of > > > ovn-controllers that makes SB busy, it is just 3 - 5 more clients > retrying > > > reconnect SB for several times, and if NB is not that busy (most > likely), > > > these northd clients should get the proper probe settings applied soon > > > without causing more issues at all. So I don't think the default probe > 5 > > > sec would cause cascade effect for the initial period. @Renat @Numan > please > > > correct me if I am wrong. > > > >> > > > >> Thanks, > > > >> Han > > > >> > > > >> > Signed-off-by: zhen wang <[email protected]> > > > >> > --- > > > >> > northd/northd.c | 4 ++-- > > > >> > 1 file changed, 2 insertions(+), 2 deletions(-) > > > >> > > > > >> > diff --git a/northd/northd.c b/northd/northd.c > > > >> > index 688a6e4ef..b7e64470f 100644 > > > >> > --- a/northd/northd.c > > > >> > +++ b/northd/northd.c > > > >> > @@ -74,8 +74,8 @@ static bool use_ct_inv_match = true; > > > >> > > > > >> > /* Default probe interval for NB and SB DB connections. */ > > > >> > #define DEFAULT_PROBE_INTERVAL_MSEC 5000 > > > >> > -static int northd_probe_interval_nb = 0; > > > >> > -static int northd_probe_interval_sb = 0; > > > >> > +static int northd_probe_interval_nb = > DEFAULT_PROBE_INTERVAL_MSEC; > > > >> > +static int northd_probe_interval_sb = > DEFAULT_PROBE_INTERVAL_MSEC; > > > >> > #define MAX_OVN_TAGS 4096 > > > >> > > > > >> > /* Pipeline stages. */ > > > >> > -- > > > >> > 2.20.1 > > > >> > > > > > > _______________________________________________ > > dev mailing list > > [email protected] > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
