On Fri, Mar 15, 2019 at 03:37:20PM -0700, Han Zhou wrote: > On Fri, Mar 15, 2019 at 3:30 PM Ben Pfaff <b...@ovn.org> wrote: > > > > On Thu, Mar 14, 2019 at 10:13:56AM -0700, Han Zhou wrote: > > > From: Han Zhou <hzh...@ebay.com> > > > > > > When update is requested from follower, the leader sends AppendRequest > > > to all followers and wait until AppendReply received from majority, and > > > then it will update commit index - the new entry is regarded as committed > > > in raft log. However, this commit will not be notified to followers > > > (including the one initiated the request) until next heartbeat (ping > > > timeout), if no other pending requests. This results in long latency > > > for updates made through followers, especially when a batch of updates > > > are requested through the same follower. > > > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done > > > > > > real 0m34.154s > > > user 0m0.083s > > > sys 0m0.250s > > > > > > This patch solves the problem by sending heartbeat as soon as the commit > > > index is updated in leader. It also avoids unnessary heartbeat by > > > resetting > > > the ping timer whenever AppendRequest is broadcasted. With this patch > > > the performance is improved more than 50 times in same test: > > > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done > > > > > > real 0m0.564s > > > user 0m0.080s > > > sys 0m0.199s > > > > > > The parameters in torture test is also adjusted because of the improved > > > performance, otherwise the tests will all be skipped. > > > > The patch seems very reasonable and the concept makes sense, but when I > > run > > make -j10 check TESTSUITEFLAGS='-k ovsdb,cluster,torture -j10' > > it comes near to killing my laptop, with multiple ovsdb-servers going to > > 100% CPU. Without the patch, I don't see behavior like that at all. > > > > Do you see the same thing? > > I think this is caused by the change in torture test case, as > mentioned in the commit message: The parameters in torture test is > also adjusted because of the improved > performance, otherwise the tests will all be skipped. (because all > clients finishes the tasks at phase 0) > > Unlike other tests, I never run torture tests using -j, since it is > more related to timing and less stable. Now since I increased the > clients to 20 x 40 in the test, it is likely to kill your laptop by > using -j10. Do you have any suggestion for this? Maybe I can try keep > the number of clients small but put some sleep between requests to > slow them down.
I think that the patch is actually causing a lot more CPU use. For example, without the patch, running test 2525 (OVSDB 3-server torture test - kill/restart leader) takes about 15 seconds and uses about 5.6 seconds of CPU time: $ time make -j10 check TESTSUITEFLAGS=2525 [...] ## ------------------------------- ## ## openvswitch 2.11.90 test suite. ## ## ------------------------------- ## 2525: OVSDB 3-server torture test - kill/restart leader ok [...] real 0m15.644s user 0m4.140s sys 0m1.459s With the patch, it takes 1 3/4 minutes and uses over 178 seconds of CPU time: real 1m44.701s user 2m15.691s sys 0m42.651s That explains the problem I see with -j10 pretty well. Do you see the same thing? _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev