Hi Numan,

according to our experience, in such case the connection can never be fully
established, and northd ends up being in an endless loop with 100% CPU
utilization.

Numan Siddique <[email protected]> schrieb am Sa., 18. Sept. 2021, 00:05:

> On Fri, Sep 17, 2021 at 5:02 PM Renat Nurgaliyev <[email protected]>
> wrote:
> >
> > Hi Han,
> >
> > yes, I believe you are totally right. But it still feels like a chicken
> and
> > egg problem to me, storing the database timeout setting inside the
> database
> > itself. If there would be at least some local command line argument to
> > override timeout value, it would be already amazing, because currently
> > there is no way to control it before the database connection is made, and
> > if it cannot be made, it is too late to try to control it.
> >
>
> What about the case where the NB database is huge and it takes > 5
> seconds to fetch
> all the contents ?
>
> Numan
>
> > Thanks,
> > Renat.
> >
> > Han Zhou <[email protected]> schrieb am Fr., 17. Sept. 2021, 23:55:
> >
> > >
> > >
> > > On Fri, Sep 17, 2021 at 1:48 PM Renat Nurgaliyev <[email protected]>
> > > wrote:
> > > >
> > > > Hello Han,
> > > >
> > > > when I wrote this patch we had an issue with a very big SB database,
> > > around 1,5 gigabytes. There were no controllers or northds running, so
> the
> > > database server was without any load at all. Although OVSDB was idling,
> > > even a single northd process could not fully connect to the database
> due to
> > > its size, since it could not fetch and process the data in 5 seconds.
> > >
> > > Hi Renat, thanks for the explanation. However, suppose SB is still
> huge,
> > > if NB is not that big, the probe config in NB_Global will soon be
> applied
> > > to ovn-northd, which would probe in proper interval (desired setting
> with
> > > the SB size considered) instead of the default 5 sec, and it should
> > > succeed, right?
> > >
> > > Thanks,
> > > Han
> > >
> > > >
> > > > Since then many optimizations were made, and the database size with
> the
> > > same topology reduced to approximately twenty megabytes, so today I
> > > wouldn't be able to reproduce the problem.
> > > >
> > > > However, I am quite sure that it would still cause troubles with a
> huge
> > > scale, when SB grows to hundreds of megabytes. With the default
> timeout of
> > > 5 seconds, which is implemented in the same thread that also fetches
> and
> > > processes data, we make an artificial database size limit, which is
> not so
> > > obvoius to troubleshoot.
> > > >
> > > > Regards,
> > > > Renat.
> > > >
> > > > Han Zhou <[email protected]> schrieb am Fr., 17. Sept. 2021, 23:34:
> > > >>
> > > >>
> > > >>
> > > >> On Thu, Sep 16, 2021 at 8:05 PM Zhen Wang <[email protected]>
> wrote:
> > > >> >
> > > >> > From: zhen wang <[email protected]>
> > > >> >
> > > >> > This reverts commit 1e59feea933610b28fd4442243162ce35595cfee.
> > > >> > Above commit introduced a bug when muptiple ovn-northd instances
> work
> > > in HA
> > > >> > mode. If SB leader and active ovn-northd instance got killed by
> > > system power
> > > >> > outage, standby ovn-northd instance would never detect the
> failure.
> > > >> >
> > > >>
> > > >> Thanks Zhen! I added the Renat and Numan who worked on the reverted
> > > commit to CC, so that they can comment if this is ok.
> > > >>
> > > >> For the commit message, I think it may be decoupled from the HA
> > > scenario that is supposed to be fixed by the other patch in this
> series.
> > > The issue this patch fixes is that before the initial NB downloading is
> > > complete the northd will not send probe, so if the DB server is down
> > > (ungracefully) before the northd reads the NB_Global options, the
> northd
> > > would never probe, thus never reconnect to the new leader. (it is
> related
> > > to RAFT, but whether it is multiple northds is irrelevant)
> > > >>
> > > >> As to the original commit that is reverted by this one:
> > > >>
> > > >>     northd: Don't poll ovsdb before the connection is fully
> established
> > > >>
> > > >>     Set initial SB and NB DBs probe interval to 0 to avoid
> connection
> > > >>     flapping.
> > > >>
> > > >>     Before configured in northd_probe_interval value is actually
> applied
> > > >>     to southbound and northbound database connections, both
> connections
> > > >>     must be fully established, otherwise ovnnb_db_run() will return
> > > >>     without retrieving configuration data from northbound DB. In
> cases
> > > >>     when southbound database is big enough, default interval of 5
> > > seconds
> > > >>     will kill and retry the connection before it is fully
> established,
> > > no
> > > >>     matter what is set in northd_probe_interval. Client reconnect
> will
> > > >>     cause even more load to ovsdb-server and cause cascade effect,
> so
> > > >>     northd can never stabilise. We have more than 2000 ports in our
> lab,
> > > >>     and northd could not start before this patch, holding at 100%
> CPU
> > > >>     utilisation both itself and ovsdb-server.
> > > >>
> > > >>     After connections are established, any value in
> > > northd_probe_interval,
> > > >>     or default DEFAULT_PROBE_INTERVAL_MSEC is applied correctly.
> > > >>
> > > >> I am not sure how would the commit help. There are at most 3 - 5
> > > northds (in practice), and suppose there are tens or hundreds of
> > > ovn-controllers that makes SB busy, it is just 3 - 5 more clients
> retrying
> > > reconnect SB for several times, and if NB is not that busy (most
> likely),
> > > these northd clients should get the proper probe settings applied soon
> > > without causing more issues at all. So I don't think the default probe
> 5
> > > sec would cause cascade effect for the initial period. @Renat @Numan
> please
> > > correct me if I am wrong.
> > > >>
> > > >> Thanks,
> > > >> Han
> > > >>
> > > >> > Signed-off-by: zhen wang <[email protected]>
> > > >> > ---
> > > >> >  northd/northd.c | 4 ++--
> > > >> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > >> >
> > > >> > diff --git a/northd/northd.c b/northd/northd.c
> > > >> > index 688a6e4ef..b7e64470f 100644
> > > >> > --- a/northd/northd.c
> > > >> > +++ b/northd/northd.c
> > > >> > @@ -74,8 +74,8 @@ static bool use_ct_inv_match = true;
> > > >> >
> > > >> >  /* Default probe interval for NB and SB DB connections. */
> > > >> >  #define DEFAULT_PROBE_INTERVAL_MSEC 5000
> > > >> > -static int northd_probe_interval_nb = 0;
> > > >> > -static int northd_probe_interval_sb = 0;
> > > >> > +static int northd_probe_interval_nb =
> DEFAULT_PROBE_INTERVAL_MSEC;
> > > >> > +static int northd_probe_interval_sb =
> DEFAULT_PROBE_INTERVAL_MSEC;
> > > >> >  #define MAX_OVN_TAGS 4096
> > > >> >
> > > >> >  /* Pipeline stages. */
> > > >> > --
> > > >> > 2.20.1
> > > >> >
> > >
> > _______________________________________________
> > dev mailing list
> > [email protected]
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
>
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to