On Tue, Mar 19, 2019 at 11:42:30PM -0700, Han Zhou wrote:
> On Tue, Mar 19, 2019 at 6:42 PM Ben Pfaff <[email protected]> wrote:
> >
> > On Mon, Mar 18, 2019 at 05:02:15PM -0700, Han Zhou wrote:
> > > On Mon, Mar 18, 2019 at 2:49 PM Ben Pfaff <[email protected]> wrote:
> > > >
> > > > On Fri, Mar 15, 2019 at 04:17:35PM -0700, Han Zhou wrote:
> > > > > From: Han Zhou <[email protected]>
> > > > >
> > > > > When update is requested from follower, the leader sends AppendRequest
> > > > > to all followers and wait until AppendReply received from majority, 
> > > > > and
> > > > > then it will update commit index - the new entry is regarded as 
> > > > > committed
> > > > > in raft log. However, this commit will not be notified to followers
> > > > > (including the one initiated the request) until next heartbeat (ping
> > > > > timeout), if no other pending requests. This results in long latency
> > > > > for updates made through followers, especially when a batch of updates
> > > > > are requested through the same follower.
> > > > >
> > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> > > > >
> > > > > real    0m34.154s
> > > > > user    0m0.083s
> > > > > sys 0m0.250s
> > > > >
> > > > > This patch solves the problem by sending heartbeat as soon as the 
> > > > > commit
> > > > > index is updated in leader. It also avoids unnessary heartbeat by 
> > > > > resetting
> > > > > the ping timer whenever AppendRequest is broadcasted. With this patch
> > > > > the performance is improved more than 50 times in same test:
> > > > >
> > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> > > > >
> > > > > real    0m0.564s
> > > > > user    0m0.080s
> > > > > sys 0m0.199s
> > > > >
> > > > > Some sleep is added in torture test cases because of the improved
> > > > > performance, otherwise the tests will all be skipped.
> > > > >
> > > > > Signed-off-by: Han Zhou <[email protected]>
> > > > > ---
> > > > >
> > > > > Notes:
> > > > >     v1->v2: adjust torture test case so that it passes without 
> > > > > overload CPU.
> > > >
> > > > With this patch, on my laptop, running test 2525 seems to always skip
> > > > it, with results similar to the following:
> > > >
> > > >     ## ------------------------------- ##
> > > >     ## openvswitch 2.11.90 test suite. ##
> > > >     ## ------------------------------- ##
> > > >     2525: OVSDB 3-server torture test - kill/restart leader skipped 
> > > > (ovsdb-cluster.at:198)
> > > >
> > > >     ## ------------- ##
> > > >     ## Test results. ##
> > > >     ## ------------- ##
> > > >
> > > >     0 tests were successful.
> > > >     1 test was skipped.
> > > >     make[3]: Leaving directory '/home/blp/nicira/ovs/_build'
> > > >     make[2]: Leaving directory '/home/blp/nicira/ovs/_build'
> > > >     make[1]: Leaving directory '/home/blp/nicira/ovs/_build'
> > > >
> > > >     real    0m9.194s
> > > >     user    0m3.693s
> > > >     sys     0m1.658s
> > > >     blp@sigill:~/nicira/ovs/_build(0)$
> > >
> > > Sorry to hear :(. It was pretty stable on my laptop - maybe my laptop
> > > is slower than yours :). I just sent V3 to make the test case more
> > > stable. I reduced the interval of the checking loop so that it can
> > > detect phase changes and trigger the operations asap. I ran all
> > > torture tests with -j1, -j5 and -j10. All cases passed without
> > > skipping. I hope it is stable on your laptop, too. Could you try
> > > again?
> >
> > I do tend to buy nice laptops, current one is i7-8565U.
> 
> Mine is i7-7920HQ, and I am running in a VM ...
> 
> >
> > This version (v3) does not skip the test and does not use excessive CPU.
> > Splendid.
> >
> > However, I am concerned that it makes the test a lot easier.
> > My design goal in this test was to try to provoke tons of races by
> > throwing many transactions at the server at once.  That is why it
> > invoked all of the ovn-sbctl calls without any "sleep"s.  By adding
> > sleeps, I think that the test becomes easier: aren't you basically
> > serializing all of the transactions?
> 
> Yes the degree of parallelism is reduced with sleep. The V1 was trying
> to keep the parallelism simply by adding more clients, but it causes
> high CPU.

I assumed that the high CPU must be caused by some kind of busy-looping
bug.  Are you sure that it's really just due to a lot of clients?

> The error is: db_ctl_base|ERR|transaction error: {"details":"transact
> request specifies unknown database OVN_Southbound","error":"unknown
> database"
> It is triggered when a server is disconnected from the cluster, but
> still communicates with clients. In fact, there are at least two
> problems:
> 1. When client retrying connection, it didn't pick another server, but
> connected to same server
> 2. I fixed 1) by calling jsonrpc_session_pick_remote() in
> jsonrpc_session_force_reconnect(), but the client still fails.
> I will need more time to debug, and submit separate patches since this
> is not directly related to current patch. Any hints are welcome.

I'm glad to hear about more bug fixes.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to