On Wed, Mar 20, 2019 at 10:44:55AM -0700, Han Zhou wrote: > On Wed, Mar 20, 2019 at 9:56 AM Ben Pfaff <[email protected]> wrote: > > > > On Wed, Mar 20, 2019 at 08:28:50AM -0700, Han Zhou wrote: > > > On Wed, Mar 20, 2019 at 5:02 AM Ilya Maximets <[email protected]> > > > wrote: > > > > > > > > On 20.03.2019 8:56, Han Zhou wrote: > > > > > From: Han Zhou <[email protected]> > > > > > > > > > > When update is requested from follower, the leader sends AppendRequest > > > > > to all followers and wait until AppendReply received from majority, > > > > > and > > > > > then it will update commit index - the new entry is regarded as > > > > > committed > > > > > in raft log. However, this commit will not be notified to followers > > > > > (including the one initiated the request) until next heartbeat (ping > > > > > timeout), if no other pending requests. This results in long latency > > > > > for updates made through followers, especially when a batch of updates > > > > > are requested through the same follower. > > > > > > > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done > > > > > > > > > > real 0m34.154s > > > > > user 0m0.083s > > > > > sys 0m0.250s > > > > > > > > > > This patch solves the problem by sending heartbeat as soon as the > > > > > commit > > > > > index is updated in leader. It also avoids unnessary heartbeat by > > > > > resetting > > > > > the ping timer whenever AppendRequest is broadcasted. With this patch > > > > > the performance is improved more than 50 times in same test: > > > > > > > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done > > > > > > > > > > real 0m0.564s > > > > > user 0m0.080s > > > > > sys 0m0.199s > > > > > > > > > > Torture test cases are also updated because otherwise the tests will > > > > > all be skipped because of the improved performance. > > > > > > > > > > Signed-off-by: Han Zhou <[email protected]> > > > > > --- > > > > > > > > > > Notes: > > > > > v3->v4: Update torture tests again. Instead of sleeping, the size > > > > > of > > > > > transaction of each client is increased to slow down the > > > > > execution so that the > > > > > chance of parallel executions are not reduced. > > > > > > > > > > > > > Unfortunately, this patch fails all the testsuite runs on TravisCI: > > > > > > > > https://travis-ci.org/ovsrobot/ovs/builds/508777615 > > > > > > > > And some on CirrusCI too: > > > > > > > > https://cirrus-ci.com/build/5201766546145280 > > > > > > > > Best regards, Ilya Maximets. > > > > > > > > > > Does the CI retry failed tests? The failed ones are some of the > > > torture tests in ovsdb-cluster.at, which was discussed here: > > > https://mail.openvswitch.org/pipermail/ovs-dev/2019-March/357373.html > > > > > > Basically, the failures are real bugs that are not caused by this > > > patch code itself, but triggered by the test case change in this > > > patch. > > > > > > The test cases are improved in this patch so that can now find the bug > > > that was not found before. To avoid CI failure, we can either merge V3 > > > (the tests were less effective), or wait until the bug is fixed. > > > > Both of these do retry failed tests. You can see the details from the > > logs at the URLs that Ilya cited. > > Yes, checking the log again, I saw the failed torture tests are > retried once, and some of them failed again when retrying, which make > me more confident for the effectiveness of the updated test cases. I > may be distracted today but I will continue debugging tomorrow. I am > pretty confident that the bug is not related to the current patch, > because it is easy to reproduce the failures such as test 2528 and > 2533 with current master applying only the torture test case change.
By the way, I totally support this effort and I'm really looking forward to applying the fixes when we figure out how to make the tests both effective and pass in the normal case. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
