On Tue, Mar 19, 2019 at 6:42 PM Ben Pfaff <[email protected]> wrote:
>
> On Mon, Mar 18, 2019 at 05:02:15PM -0700, Han Zhou wrote:
> > On Mon, Mar 18, 2019 at 2:49 PM Ben Pfaff <[email protected]> wrote:
> > >
> > > On Fri, Mar 15, 2019 at 04:17:35PM -0700, Han Zhou wrote:
> > > > From: Han Zhou <[email protected]>
> > > >
> > > > When update is requested from follower, the leader sends AppendRequest
> > > > to all followers and wait until AppendReply received from majority, and
> > > > then it will update commit index - the new entry is regarded as 
> > > > committed
> > > > in raft log. However, this commit will not be notified to followers
> > > > (including the one initiated the request) until next heartbeat (ping
> > > > timeout), if no other pending requests. This results in long latency
> > > > for updates made through followers, especially when a batch of updates
> > > > are requested through the same follower.
> > > >
> > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> > > >
> > > > real    0m34.154s
> > > > user    0m0.083s
> > > > sys 0m0.250s
> > > >
> > > > This patch solves the problem by sending heartbeat as soon as the commit
> > > > index is updated in leader. It also avoids unnessary heartbeat by 
> > > > resetting
> > > > the ping timer whenever AppendRequest is broadcasted. With this patch
> > > > the performance is improved more than 50 times in same test:
> > > >
> > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> > > >
> > > > real    0m0.564s
> > > > user    0m0.080s
> > > > sys 0m0.199s
> > > >
> > > > Some sleep is added in torture test cases because of the improved
> > > > performance, otherwise the tests will all be skipped.
> > > >
> > > > Signed-off-by: Han Zhou <[email protected]>
> > > > ---
> > > >
> > > > Notes:
> > > >     v1->v2: adjust torture test case so that it passes without overload 
> > > > CPU.
> > >
> > > With this patch, on my laptop, running test 2525 seems to always skip
> > > it, with results similar to the following:
> > >
> > >     ## ------------------------------- ##
> > >     ## openvswitch 2.11.90 test suite. ##
> > >     ## ------------------------------- ##
> > >     2525: OVSDB 3-server torture test - kill/restart leader skipped 
> > > (ovsdb-cluster.at:198)
> > >
> > >     ## ------------- ##
> > >     ## Test results. ##
> > >     ## ------------- ##
> > >
> > >     0 tests were successful.
> > >     1 test was skipped.
> > >     make[3]: Leaving directory '/home/blp/nicira/ovs/_build'
> > >     make[2]: Leaving directory '/home/blp/nicira/ovs/_build'
> > >     make[1]: Leaving directory '/home/blp/nicira/ovs/_build'
> > >
> > >     real    0m9.194s
> > >     user    0m3.693s
> > >     sys     0m1.658s
> > >     blp@sigill:~/nicira/ovs/_build(0)$
> >
> > Sorry to hear :(. It was pretty stable on my laptop - maybe my laptop
> > is slower than yours :). I just sent V3 to make the test case more
> > stable. I reduced the interval of the checking loop so that it can
> > detect phase changes and trigger the operations asap. I ran all
> > torture tests with -j1, -j5 and -j10. All cases passed without
> > skipping. I hope it is stable on your laptop, too. Could you try
> > again?
>
> I do tend to buy nice laptops, current one is i7-8565U.

Mine is i7-7920HQ, and I am running in a VM ...

>
> This version (v3) does not skip the test and does not use excessive CPU.
> Splendid.
>
> However, I am concerned that it makes the test a lot easier.
> My design goal in this test was to try to provoke tons of races by
> throwing many transactions at the server at once.  That is why it
> invoked all of the ovn-sbctl calls without any "sleep"s.  By adding
> sleeps, I think that the test becomes easier: aren't you basically
> serializing all of the transactions?

Yes the degree of parallelism is reduced with sleep. The V1 was trying
to keep the parallelism simply by adding more clients, but it causes
high CPU.
I just sent V4, which keeps the original parallelism without
increasing much CPU cost by increasing the size of each transaction. I
tried several times and it never skips tests. In fact I see it more
effective than before, because the test fails more frequently than
before. The failure is not caused by this patch, but a bug before this
change, since I have noticed it before (and was planning to debug it
but didn't got much time yet)

The error is: db_ctl_base|ERR|transaction error: {"details":"transact
request specifies unknown database OVN_Southbound","error":"unknown
database"
It is triggered when a server is disconnected from the cluster, but
still communicates with clients. In fact, there are at least two
problems:
1. When client retrying connection, it didn't pick another server, but
connected to same server
2. I fixed 1) by calling jsonrpc_session_pick_remote() in
jsonrpc_session_force_reconnect(), but the client still fails.
I will need more time to debug, and submit separate patches since this
is not directly related to current patch. Any hints are welcome.

Thanks,
Han
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to