On Wed, Mar 20, 2019 at 11:07 AM Ben Pfaff <[email protected]> wrote:
>
> On Wed, Mar 20, 2019 at 10:44:55AM -0700, Han Zhou wrote:
> > On Wed, Mar 20, 2019 at 9:56 AM Ben Pfaff <[email protected]> wrote:
> > >
> > > On Wed, Mar 20, 2019 at 08:28:50AM -0700, Han Zhou wrote:
> > > > On Wed, Mar 20, 2019 at 5:02 AM Ilya Maximets <[email protected]> 
> > > > wrote:
> > > > >
> > > > > On 20.03.2019 8:56, Han Zhou wrote:
> > > > > > From: Han Zhou <[email protected]>
> > > > > >
> > > > > > When update is requested from follower, the leader sends 
> > > > > > AppendRequest
> > > > > > to all followers and wait until AppendReply received from majority, 
> > > > > > and
> > > > > > then it will update commit index - the new entry is regarded as 
> > > > > > committed
> > > > > > in raft log. However, this commit will not be notified to followers
> > > > > > (including the one initiated the request) until next heartbeat (ping
> > > > > > timeout), if no other pending requests. This results in long latency
> > > > > > for updates made through followers, especially when a batch of 
> > > > > > updates
> > > > > > are requested through the same follower.
> > > > > >
> > > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> > > > > >
> > > > > > real    0m34.154s
> > > > > > user    0m0.083s
> > > > > > sys 0m0.250s
> > > > > >
> > > > > > This patch solves the problem by sending heartbeat as soon as the 
> > > > > > commit
> > > > > > index is updated in leader. It also avoids unnessary heartbeat by 
> > > > > > resetting
> > > > > > the ping timer whenever AppendRequest is broadcasted. With this 
> > > > > > patch
> > > > > > the performance is improved more than 50 times in same test:
> > > > > >
> > > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> > > > > >
> > > > > > real    0m0.564s
> > > > > > user    0m0.080s
> > > > > > sys 0m0.199s
> > > > > >
> > > > > > Torture test cases are also updated because otherwise the tests will
> > > > > > all be skipped because of the improved performance.
> > > > > >
> > > > > > Signed-off-by: Han Zhou <[email protected]>
> > > > > > ---
> > > > > >
> > > > > > Notes:
> > > > > >     v3->v4: Update torture tests again. Instead of sleeping, the 
> > > > > > size of
> > > > > >     transaction of each client is increased to slow down the 
> > > > > > execution so that the
> > > > > >     chance of parallel executions are not reduced.
> > > > > >
> > > > >
> > > > > Unfortunately, this patch fails all the testsuite runs on TravisCI:
> > > > >
> > > > >   https://travis-ci.org/ovsrobot/ovs/builds/508777615
> > > > >
> > > > > And some on CirrusCI too:
> > > > >
> > > > >   https://cirrus-ci.com/build/5201766546145280
> > > > >
> > > > > Best regards, Ilya Maximets.
> > > > >
> > > >
> > > > Does the CI retry failed tests? The failed ones are some of the
> > > > torture tests in ovsdb-cluster.at, which was discussed here:
> > > > https://mail.openvswitch.org/pipermail/ovs-dev/2019-March/357373.html
> > > >
> > > > Basically, the failures are real bugs that are not caused by this
> > > > patch code itself, but triggered by the test case change in this
> > > > patch.
> > > >
> > > > The test cases are improved in this patch so that can now find the bug
> > > > that was not found before. To avoid CI failure, we can either merge V3
> > > > (the tests were less effective), or wait until the bug is fixed.
> > >
> > > Both of these do retry failed tests.  You can see the details from the
> > > logs at the URLs that Ilya cited.
> >
> > Yes, checking the log again, I saw the failed torture tests are
> > retried once, and some of them failed again when retrying, which make
> > me more confident for the effectiveness of the updated test cases. I
> > may be distracted today but I will continue debugging tomorrow. I am
> > pretty confident that the bug is not related to the current patch,
> > because it is easy to reproduce the failures such as test 2528 and
> > 2533 with current master applying only the torture test case change.
>
> By the way, I totally support this effort and I'm really looking forward
> to applying the fixes when we figure out how to make the tests both
> effective and pass in the normal case.

Hi Ben, I fixed a reconnection bug which was causing the client IDL
reconnect to same old server after the server is disconnected from
cluster in the torture test:
https://mail.openvswitch.org/pipermail/ovs-dev/2019-March/357443.html
When server was disconnected, it sent out monitor_cancelled message to
client, so the reconnect FSM transitioned back to ACTIVE with same old
server because of the activity on the session. After the fix, the
torture tests all passed in several runs on my laptop with current
patch V4. Please try it and let me know.

Thanks,
Han
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to