On 2/18/20 7:12 AM, Anton Ivanov wrote: > > > On 17/02/2020 14:48, Dumitru Ceara wrote: >> On 2/14/20 6:54 PM, [email protected] wrote: >>> From: Anton Ivanov <[email protected]> >>> >>> 1. Adds "persistent" behaviour where feasible (streams and signals). >>> These are waited upon in the same thread where they are created. This >>> allows them to be registered persistently with the OS (if possible) >>> as well as the OS to provide hints - is the FD ready, is it closed, >>> etc. >>> >>> 2. Removes unnecessary attempts to perform a read vs EAGAIN on a fd >>> which is not ready if that fd has been registered as "private" to the >>> thread which waits upon it. >>> >>> 3. No longer breaks other parts of OVS which create the fd in one >>> thread and waits upon it in others. >>> >>> 3. Adds support for EPOLL on Linux and can be expanded to cover similar >>> poll++ frameworks in other OSes. >>> >>> 4. Sets up the necessary infrastructure to make IO/SSL multi-threaded >>> using a "centeral (e)poll dispatcher + IO threads" pattern >>> >>> Signed-off-by: Anton Ivanov <[email protected]> >> >> Hi Anton, >> >> A couple of issues inline. Except for that: >> >> 1. The "STP - flush the fdb and mdb when topology changed" OVS test is >> failing with your patches applied: >> >> make check TESTSUITEFLAGS='-k "flush the fdb"' > > I will have a look. > >> >> 2. Travis CI build fails: >> >> lib/fatal-signal.c:244:5: error: ignoring return value of ‘read’, >> declared with attribute warn_unused_result [-Werror=unused-result] >> >> read(signal_fds[0], sigbuffer, sizeof(sigbuffer)) >> >> 3. Travis CI OSX build fails: >> >> lib/poll-loop.c:46:1: error: unused function 'poll_create_node_add' >> [-Werror,-Wunused-function] >> >> COVERAGE_DEFINE(poll_create_node); > > I will fix all of these in the next version. The CI was out with the > daisies when I submitted the patch last week so I did not see the logs > until yesterday. > >> >> 4. While OVS might benefit from these changes I'm wondering about OVN >> and ovsdb-server specifically. ovsdb-server is single threaded and >> usually on large scale deployments we don't really see "poll" as the >> bottleneck or even the fact that code tries to read/write from FDs when >> FDs are not available for read/write. >> >> For example, here are results of running a scale test scenario which >> repeats the following iteration 300 times: >> - bring up a node (ovn-fake-multinode container) and connect it to the >> OVN Southbound DB. >> - configure an OVN logical switch to be bound to the new node. >> - configure an OVN logical switch port on the new logical switch. >> - configure an OVS internal interface on the new node and bind it to the >> OVN logical switch port. >> - wait until the new internal interface can ping its default gateway >> through OVN (i.e., until ovn-controller on the node received all updates >> from the SB DB and installed all OVS flows), highlighted in the output. >> >> The tests use rally-ovs (ovn-scale-test) on a 9 server setup (1 machine >> running OVN ovsdb-servers and ovn-northd and 8 machines simulating >> chassis using ovn-fake-multinode), in particular this modified scenario: >> https://github.com/dceara/ovn-scale-test/blob/ovn-switch-per-node/samples/tasks/scenarios/ovn-network/osh_workload_incremental.json >> >> >> With OVS master and OVN master: >> http://pastebin.test.redhat.com/836568 >> >> With OVS master + your patches and OVN master: >> http://pastebin.test.redhat.com/836571 >> >> Here are some of the logs we get on the OVN Southbound DB ovsdb-server >> that show that ovsdb-server spends up to 2 seconds in a single loop >> iteration sending/receiving updates to/from ovn-controllers: >> >> 2020-02-17T10:43:41.175Z|01991|poll_loop|INFO|wakeup due to [OVS_POLLIN] >> on fd 140 (192.16.0.1:6642<->192.16.0.120:52018) at lib/stream-fd.c:79 >> (84% CPU usage) >> 2020-02-17T10:43:43.338Z|01992|timeval|WARN|Unreasonably long 2163ms >> poll interval (2144ms user, 9ms system) >> 2020-02-17T10:43:43.339Z|01993|timeval|WARN|faults: 590 minor, 0 major >> 2020-02-17T10:43:43.339Z|01994|timeval|WARN|disk: 0 reads, 8 writes >> 2020-02-17T10:43:43.339Z|01995|timeval|WARN|context switches: 0 >> voluntary, 4 involuntary >> 2020-02-17T10:43:43.339Z|01996|poll_loop|INFO|Dropped 63 log messages in >> last 2 seconds (most recently, 2 seconds ago) due to excessive rate >> 2020-02-17T10:43:43.339Z|01997|poll_loop|INFO|wakeup due to [OVS_POLLIN] >> on fd 76 (192.16.0.1:6642<->192.16.0.56:33538) at lib/stream-fd.c:79 >> (84% CPU usage) >> 2020-02-17T10:43:45.495Z|01998|timeval|WARN|Unreasonably long 2156ms >> poll interval (2129ms user, 17ms system) >> 2020-02-17T10:43:45.495Z|01999|timeval|WARN|faults: 738 minor, 0 major >> 2020-02-17T10:43:45.495Z|02000|timeval|WARN|context switches: 0 >> voluntary, 7 involuntary >> 2020-02-17T10:43:47.651Z|02001|timeval|WARN|Unreasonably long 2157ms >> poll interval (2136ms user, 10ms system) >> >> In this case, and I think in most OVN use cases, ovsdb-server is busy >> because it actually has to send updates to large numbers of >> ovn-controllers connected to it. Unless I'm missing something the epoll >> change seems to improve performance only in cases where the Southbound >> DB doesn't do much sending/receiving. > > 1. Correct - it improves the handling and the cost of idle connections. > At present each connection is a fixed cost regardless of does it need > servicing or not. > > 2. It should also improve the cost of handling of many peers to send if > there is enough outstanding data on sockets to create EGAIN on send. > > 3. It does not fix the fundamental problem that the logic in ovsdb is > single threaded. It does, however allow json+io+ssl to become > multi-threaded which should leave just the logic in the main ovsd-server > thread. You cannot do that effectively without having information on the > state of the socket and does it need servicing. > > 4. I have seen the rally tests - they are flat scale-up. I agree that > the help from fixing the IO loop will be minimal because there is IO on > most fds to be serviced at all times. I would not expect them to help a > lot there. > > The difference from fixing the IO (leaving aside that it is a > prerequisite to getting SSL to worker threads) is not in scaling up, but > running at scale. The current "ram the EGAIN wall until it gives up" > design gives you a mandatory penalty per transaction deploying config to > one node while running (because of attempted failed reads on EAGAIN). > The correct test for that is the CPU cost and latency to deploy a > logical flow when running at steady state in f.e. 300 nodes.
Currently when a logical flow is added to the SB DB for a logical datapath (logical switch or logical router) all ovn-controllers that consider the logical datapath as "local" will receive the update from ovsdb SB. ovn-controller local datapaths are: - all logical switch datapaths that have ports bound to the local chassis. - all logical datapaths corresponding to logical routers connected through OVN patch ports to local logical switches. While it's true that in some deployments this will limit the number of ovn-controllers receiving the update, in OVN-k8s, for example, all node logical switches are connected to a single cluster logical router. Once a logical flow is added to any of the node logical switches all ovn-controllers on all nodes should receive the update. This means that there won't be too many idle connections during a SB DB logical flow update. I agree that the right way to test is to see how long it takes for a logical flow to be installed, which is what the rally-ovs tests do: they measure how long until the SB DB logical flow is translated to openflow by ovn-controller when a new node is brought up. At iteration X+1 we are at a steady state (X nodes up and configured) and node X+1 is brought up and its corresponding logical flows are added. > > The current design as written has that penalty as a given - it is > unavoidable and it also grows linear (or worse) with size from other > factors in addition to EGAIN - f.e you also start copying a large pollfd > array to the kernel back and fourth on every iteration. > > I have not seen any tests trying to quantify this on a actual cluster. > All tests I have seen so far are scale-up and/or running a single > synthetic application on the whole cluster. > > Mine are mostly taking the code in question and running it outside of > OVS/OVN on a harness. > > >> How do you test >> performance/scalability improvements? >> >> Regards, >> Dumitru >> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
