On Tue, Aug 4, 2020 at 11:31 PM Han Zhou <[email protected]> wrote: > On Tue, Aug 4, 2020 at 12:38 AM Numan Siddique <[email protected]> wrote: > > > > > > > On Tue, Aug 4, 2020 at 9:02 AM Tony Liu <[email protected]> wrote: > > > >> The probe awakes recomputing? > >> There is probe every 5 seconds. Without any connection up/down or > >> failover, > >> ovn-northd will recompute everything every 5 seconds, no matter what? > >> Really? > >> > >> Anyways, I will increase the probe interval for now, see if that helps. > >> > > > > I think we should optimise this case. I am planning to look into this. > > > > Thanks > > Numan > > > > Thanks Numan. > I'd like to discuss more on this before we move forward to change anything. > > 1) Regarding the problem itself, the CPU cost triggered by OVSDB IDLE probe > when there is no configuration change to compute, I don't think it matters > that much in real production. It simply wastes CPU cycles when there is > nothing to do, so what harm would it do here? For ovn-northd, since it is > the centralized component, we would always ensure there is enough CPU > available for ovn-north when computing is needed, and this reservation will > be wasted anyway when there is no change to compute. So, I'd avoid making > any change specifically only to address this issue. I could be wrong, > though. I'd like to hear what would be the real concern if this is not > addressed. >
Agree with you. > > 2) ovn-northd incremental processing would avoid this CPU problem > naturally. So let's discuss how to move forward for incremental processing, > which is much more important because it also solves the CPU efficiency when > handling the changes, and the IDLE probe problem is just a byproduct. I > believe the DDlog branch would have solved this problem. However, it seems > we are not sure about the current status of DDlog. As you proposed at the > last OVN meeting, an alternative is to implement partial > incremental-processing using the I-P engine like ovn-controller. While I > have no objection to this, we'd better check with Ben and Leonid on the > plan to avoid overlapping and waste of work. @Ben @Leonid, would you mind > sharing the status here since you were not at the meeting last week? > > My idea is not to address this issue specifically. But add a very minimalistic I-P support that avoids unnecessary recomputation to start with. I don't want to spend too much time or make it complicated. But start with a simple one which will trigger full recompute for any NB DB or SB DB change. Thanks Numan Thanks, > Han > > > > > > >> > >> > >> Thanks! > >> > >> Tony > >> > >> > -----Original Message----- > >> > From: Han Zhou <[email protected]> > >> > Sent: Monday, August 3, 2020 8:22 PM > >> > To: Tony Liu <[email protected]> > >> > Cc: Han Zhou <[email protected]>; ovs-discuss < > [email protected] > >> >; > >> > ovs-dev <[email protected]> > >> > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no > >> > configuration update > >> > > >> > Sorry that I didn't make it clear enough. The OVSDB probe itself > doesn't > >> > take much CPU, but the probe awakes ovn-northd main loop, which > >> > recompute everything, which is why you see CPU spike. > >> > It will be solved by incremental-processing, when only delta is > >> > processed, and in case of probe handling, there is no change in > >> > configuration, so the delta is zero. > >> > For now, please follow the steps to adjust probe interval, if the CPU > of > >> > ovn-northd (when there is no configuration change) is a concern for > you. > >> > But please remember that this has no impact to the real CPU usage for > >> > handling configuration changes. > >> > > >> > > >> > Thanks, > >> > Han > >> > > >> > > >> > On Mon, Aug 3, 2020 at 8:11 PM Tony Liu <[email protected] > >> > <mailto:[email protected]> > wrote: > >> > > >> > > >> > Health check (5 sec internal) taking 30%-100% CPU is definitely > >> not > >> > acceptable, > >> > if that's really the case. There must be some blocking (and not > >> > yielding CPU) > >> > in coding, which is not supposed to be there. > >> > > >> > Could you point me to the coding for such health check? > >> > Is it single thread? Does it use any event library? > >> > > >> > > >> > Thanks! > >> > > >> > Tony > >> > > >> > > -----Original Message----- > >> > > From: Han Zhou <[email protected] <mailto:[email protected]> > > >> > > Sent: Saturday, August 1, 2020 9:11 PM > >> > > To: Tony Liu <[email protected] > >> > <mailto:[email protected]> > > >> > > Cc: ovs-discuss <[email protected] <mailto:ovs- > >> > [email protected]> >; ovs-dev <ovs- > >> > > [email protected] <mailto:[email protected]> > > >> > > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU > when > >> > no > >> > > configuration update > >> > > > >> > > > >> > > > >> > > On Fri, Jul 31, 2020 at 4:14 PM Tony Liu < > >> [email protected] > >> > <mailto:[email protected]> > >> > > <mailto:[email protected] > >> > <mailto:[email protected]> > > wrote: > >> > > > >> > > > >> > > Hi, > >> > > > >> > > I see the active ovn-northd takes much CPU (30% - 100%) > >> > when there > >> > > is no > >> > > configuration from OpenStack, nothing happening on all > >> > chassis > >> > > nodes either. > >> > > > >> > > Is this expected? What is it busy with? > >> > > > >> > > > >> > > > >> > > > >> > > Yes, this is expected. It is due to the OVSDB probe between > ovn- > >> > northd > >> > > and NB/SB OVSDB servers, which is used to detect the OVSDB > >> > connection > >> > > failure. > >> > > Usually this is not a concern (unlike the probe with a large > >> > number of > >> > > ovn-controller clients), because ovn-northd is a centralized > >> > component > >> > > and the CPU cost when there is no configuration change doesn't > >> > matter > >> > > that much. However, if it is a concern, the probe interval > >> > (default 5 > >> > > sec) can be changed. > >> > > If you change, remember to change on both server side and > client > >> > side. > >> > > For client side (ovn-northd), it is configured in the NB DB's > >> > NB_Global > >> > > table's options:northd_probe_interval. See man page of > >> ovn-nb(5). > >> > > For server side (NB and SB), it is configured in the NB and SB > >> > DB's > >> > > Connection table's inactivity_probe column. > >> > > > >> > > Thanks, > >> > > Han > >> > > > >> > > > >> > > > >> > > ======== > >> > > 2020-07-31T23:08:09.511Z|04267|poll_loop|DBG|wakeup due > to > >> > [POLLIN] > >> > > on fd 8 (10.6.20.84:44358 <http://10.6.20.84:44358> > >> > <http://10.6.20.84:44358> <->10.6.20.84:6641 <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> ) at lib/stream-fd.c:157 (68% CPU > >> usage) > >> > > 2020-07- > >> > 31T23:08:09.512Z|04268|jsonrpc|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : received request, method="echo", > >> > params=[], > >> > > id="echo" > >> > > 2020-07- > >> > 31T23:08:09.512Z|04269|jsonrpc|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : send reply, result=[], id="echo" > >> > > 2020-07-31T23:08:12.777Z|04270|poll_loop|DBG|wakeup due > to > >> > [POLLIN] > >> > > on fd 9 (10.6.20.84:49158 <http://10.6.20.84:49158> > >> > <http://10.6.20.84:49158> <->10.6.20.85:6642 <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> ) at lib/stream-fd.c:157 (34% CPU > >> usage) > >> > > 2020-07- > >> > 31T23:08:12.777Z|04271|reconnect|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : idle 5002 ms, sending inactivity > >> probe > >> > > 2020-07- > >> > 31T23:08:12.777Z|04272|reconnect|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : entering IDLE > >> > > 2020-07- > >> > 31T23:08:12.777Z|04273|jsonrpc|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : send request, method="echo", > >> params=[], > >> > > id="echo" > >> > > 2020-07- > >> > 31T23:08:12.777Z|04274|jsonrpc|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : received request, method="echo", > >> > params=[], > >> > > id="echo" > >> > > 2020-07- > >> > 31T23:08:12.777Z|04275|reconnect|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : entering ACTIVE > >> > > 2020-07- > >> > 31T23:08:12.777Z|04276|jsonrpc|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : send reply, result=[], id="echo" > >> > > 2020-07-31T23:08:13.635Z|04277|poll_loop|DBG|wakeup due > to > >> > [POLLIN] > >> > > on fd 9 (10.6.20.84:49158 <http://10.6.20.84:49158> > >> > <http://10.6.20.84:49158> <->10.6.20.85:6642 <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> ) at lib/stream-fd.c:157 (34% CPU > >> usage) > >> > > 2020-07- > >> > 31T23:08:13.635Z|04278|jsonrpc|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : received reply, result=[], > id="echo" > >> > > 2020-07-31T23:08:14.480Z|04279|hmap|DBG|Dropped 129 log > >> > messages in > >> > > last 5 seconds (most recently, 0 seconds ago) due to excessive > >> > rate > >> > > > 2020-07-31T23:08:14.480Z|04280|hmap|DBG|lib/shash.c:112: 2 > >> > buckets > >> > > with 6+ nodes, including 2 buckets with 6 nodes (32 nodes > total > >> > across > >> > > 32 buckets) > >> > > 2020-07-31T23:08:14.513Z|04281|poll_loop|DBG|wakeup due > to > >> > 27-ms > >> > > timeout at lib/reconnect.c:643 (34% CPU usage) > >> > > 2020-07- > >> > 31T23:08:14.513Z|04282|reconnect|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : idle 5001 ms, sending inactivity > >> probe > >> > > 2020-07- > >> > 31T23:08:14.513Z|04283|reconnect|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : entering IDLE > >> > > 2020-07- > >> > 31T23:08:14.513Z|04284|jsonrpc|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : send request, method="echo", > >> params=[], > >> > > id="echo" > >> > > 2020-07-31T23:08:15.370Z|04285|poll_loop|DBG|wakeup due > to > >> > [POLLIN] > >> > > on fd 8 (10.6.20.84:44358 <http://10.6.20.84:44358> > >> > <http://10.6.20.84:44358> <->10.6.20.84:6641 <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> ) at lib/stream-fd.c:157 (34% CPU > >> usage) > >> > > 2020-07- > >> > 31T23:08:15.370Z|04286|jsonrpc|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : received request, method="echo", > >> > params=[], > >> > > id="echo" > >> > > 2020-07- > >> > 31T23:08:15.370Z|04287|reconnect|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : entering ACTIVE > >> > > 2020-07- > >> > 31T23:08:15.370Z|04288|jsonrpc|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : send reply, result=[], id="echo" > >> > > 2020-07-31T23:08:16.236Z|04289|poll_loop|DBG|wakeup due > to > >> > 0-ms > >> > > timeout at tcp:10.6.20.84:6641 <http://10.6.20.84:6641> > >> > <http://10.6.20.84:6641> (100% CPU usage) > >> > > 2020-07- > >> > 31T23:08:16.236Z|04290|jsonrpc|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : received reply, result=[], > id="echo" > >> > > 2020-07-31T23:08:17.778Z|04291|poll_loop|DBG|wakeup due > to > >> > [POLLIN] > >> > > on fd 9 (10.6.20.84:49158 <http://10.6.20.84:49158> > >> > <http://10.6.20.84:49158> <->10.6.20.85:6642 <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> ) at lib/stream-fd.c:157 (100% CPU > >> usage) > >> > > 2020-07- > >> > 31T23:08:17.778Z|04292|jsonrpc|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : received request, method="echo", > >> > params=[], > >> > > id="echo" > >> > > 2020-07- > >> > 31T23:08:17.778Z|04293|jsonrpc|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : send reply, result=[], id="echo" > >> > > 2020-07-31T23:08:20.372Z|04294|poll_loop|DBG|wakeup due > to > >> > [POLLIN] > >> > > on fd 8 (10.6.20.84:44358 <http://10.6.20.84:44358> > >> > <http://10.6.20.84:44358> <->10.6.20.84:6641 <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> ) at lib/stream-fd.c:157 (41% CPU > >> usage) > >> > > 2020-07- > >> > 31T23:08:20.372Z|04295|reconnect|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : idle 5002 ms, sending inactivity > >> probe > >> > > 2020-07- > >> > 31T23:08:20.372Z|04296|reconnect|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : entering IDLE > >> > > 2020-07- > >> > 31T23:08:20.372Z|04297|jsonrpc|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : send request, method="echo", > >> params=[], > >> > > id="echo" > >> > > 2020-07- > >> > 31T23:08:20.372Z|04298|jsonrpc|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : received request, method="echo", > >> > params=[], > >> > > id="echo" > >> > > 2020-07- > >> > 31T23:08:20.372Z|04299|reconnect|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : entering ACTIVE > >> > > 2020-07- > >> > 31T23:08:20.372Z|04300|jsonrpc|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : send reply, result=[], id="echo" > >> > > 2020-07-31T23:08:20.376Z|04301|hmap|DBG|Dropped 181 log > >> > messages in > >> > > last 6 seconds (most recently, 1 seconds ago) due to excessive > >> > rate > >> > > 2020-07-31T23:08:20.376Z|04302|hmap|DBG|northd/ovn- > >> > northd.c:595: 2 > >> > > buckets with 6+ nodes, including 2 buckets with 6 nodes (256 > >> > nodes total > >> > > across 256 buckets) > >> > > 2020-07-31T23:08:21.222Z|04303|poll_loop|DBG|wakeup due > to > >> > [POLLIN] > >> > > on fd 8 (10.6.20.84:44358 <http://10.6.20.84:44358> > >> > <http://10.6.20.84:44358> <->10.6.20.84:6641 <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> ) at lib/stream-fd.c:157 (41% CPU > >> usage) > >> > > 2020-07- > >> > 31T23:08:21.223Z|04304|jsonrpc|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : received reply, result=[], > id="echo" > >> > > 2020-07-31T23:08:22.779Z|04305|poll_loop|DBG|wakeup due > to > >> > 706-ms > >> > > timeout at lib/reconnect.c:643 (41% CPU usage) > >> > > 2020-07- > >> > 31T23:08:22.779Z|04306|reconnect|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : idle 5001 ms, sending inactivity > >> probe > >> > > 2020-07- > >> > 31T23:08:22.779Z|04307|reconnect|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : entering IDLE > >> > > 2020-07- > >> > 31T23:08:22.779Z|04308|jsonrpc|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : send request, method="echo", > >> params=[], > >> > > id="echo" > >> > > 2020-07-31T23:08:23.632Z|04309|poll_loop|DBG|wakeup due > to > >> > [POLLIN] > >> > > on fd 9 (10.6.20.84:49158 <http://10.6.20.84:49158> > >> > <http://10.6.20.84:49158> <->10.6.20.85:6642 <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> ) at lib/stream-fd.c:157 (79% CPU > >> usage) > >> > > 2020-07- > >> > 31T23:08:23.632Z|04310|jsonrpc|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : received request, method="echo", > >> > params=[], > >> > > id="echo" > >> > > 2020-07- > >> > 31T23:08:23.632Z|04311|reconnect|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : entering ACTIVE > >> > > 2020-07- > >> > 31T23:08:23.632Z|04312|jsonrpc|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : send reply, result=[], id="echo" > >> > > 2020-07-31T23:08:24.481Z|04313|poll_loop|DBG|wakeup due > to > >> > 0-ms > >> > > timeout at tcp:10.6.20.85:6642 <http://10.6.20.85:6642> > >> > <http://10.6.20.85:6642> (79% CPU usage) > >> > > 2020-07- > >> > 31T23:08:24.481Z|04314|jsonrpc|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : received reply, result=[], > id="echo" > >> > > 2020-07-31T23:08:25.372Z|04315|poll_loop|DBG|wakeup due > to > >> > [POLLIN] > >> > > on fd 8 (10.6.20.84:44358 <http://10.6.20.84:44358> > >> > <http://10.6.20.84:44358> <->10.6.20.84:6641 <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> ) at lib/stream-fd.c:157 (79% CPU > >> usage) > >> > > 2020-07- > >> > 31T23:08:25.372Z|04316|jsonrpc|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : received request, method="echo", > >> > params=[], > >> > > id="echo" > >> > > 2020-07- > >> > 31T23:08:25.372Z|04317|jsonrpc|DBG|tcp:10.6.20.84:6641 > >> > <http://10.6.20.84:6641> > >> > > <http://10.6.20.84:6641> : send reply, result=[], id="echo" > >> > > 2020-07-31T23:08:28.634Z|04318|poll_loop|DBG|wakeup due > to > >> > [POLLIN] > >> > > on fd 9 (10.6.20.84:49158 <http://10.6.20.84:49158> > >> > <http://10.6.20.84:49158> <->10.6.20.85:6642 <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> ) at lib/stream-fd.c:157 (50% CPU > >> usage) > >> > > 2020-07- > >> > 31T23:08:28.634Z|04319|reconnect|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : idle 5002 ms, sending inactivity > >> probe > >> > > 2020-07- > >> > 31T23:08:28.634Z|04320|reconnect|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : entering IDLE > >> > > 2020-07- > >> > 31T23:08:28.634Z|04321|jsonrpc|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : send request, method="echo", > >> params=[], > >> > > id="echo" > >> > > 2020-07- > >> > 31T23:08:28.634Z|04322|jsonrpc|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : received request, method="echo", > >> > params=[], > >> > > id="echo" > >> > > 2020-07- > >> > 31T23:08:28.634Z|04323|reconnect|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : entering ACTIVE > >> > > 2020-07- > >> > 31T23:08:28.634Z|04324|jsonrpc|DBG|tcp:10.6.20.85:6642 > >> > <http://10.6.20.85:6642> > >> > > <http://10.6.20.85:6642> : send reply, result=[], id="echo" > >> > > 2020-07-31T23:08:28.638Z|04325|hmap|DBG|Dropped 266 log > >> > messages in > >> > > last 9 seconds (most recently, 3 seconds ago) due to excessive > >> > rate > >> > > 2020-07-31T23:08:28.638Z|04326|hmap|DBG|northd/ovn- > >> > northd.c:595: 2 > >> > > buckets with 6+ nodes, including 2 buckets with 6 nodes (256 > >> > nodes total > >> > > across 256 buckets) > >> > > ======== > >> > > > >> > > Thanks! > >> > > > >> > > Tony > >> > > > >> > > _______________________________________________ > >> > > discuss mailing list > >> > > [email protected] <mailto:[email protected] > > > >> > <mailto:[email protected] <mailto:[email protected]> > > >> > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > >> > > > >> > > >> > > >> > >> _______________________________________________ > >> discuss mailing list > >> [email protected] > >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > >> > >> > _______________________________________________ > dev mailing list > [email protected] > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
