I have been working on OVSDB Raft scale testing problems debugging
recently and here is a major finding. I thought there were different
problems but it turned out that the only major problem is the
prerequisite checking mechanism. My conclusion is that we'd better
abandon the prerequisite checking and change to the alternative
approach mentioned by Ben at an earlier OVN meeting: the follower
sends the operations requested by clients directly to leader, without
any prerequisite assumption. Here is the detailed analysis.

The test environment is for ovn-scale-test, with 3-node clusters for
NB and SB. There are 1k HVs connecting to the SB DB, connections
randomly distributed to the 3 cluster nodes. The test is creating 10k
lports and bind them on the 1k HVs. The bottleneck is in SB DB only,
because of the big number of clients and the continuous write requests
from the clients (port-binding updates).

The first problem I encountered was that the leader re-election
happened a lot during the testing because the leader wasn't able to
respond to heartbeat messages in time, due to the CPU load. The ideal
solution should be using dedicated thread for Raft messages handling,
but I tried to circumvent the problem and see what happens by
increasing the election base time from 1024ms to 10000ms. The
heartbeat messages are sent at 1/3 of the expire time, so now it is
sent out around every 3+ sec.

With the change the leader re-election problem solved. However, it
reveals the real problem why CPU load was high. The CPU load stayed at
100% on the leader, and the INFO logs on prerequisite check failure
was extremely high. For example:
2019-02-20T19:27:02.001Z|00382|raft|INFO|Dropped 299433 log messages
in last 12 seconds (most recently, 0 seconds ago) due to excessive
rate
2019-02-20T19:27:02.001Z|00383|raft|INFO|current entry eid
b4b77aaf-4f69-449b-a689-c6c2a02f452f does not match prerequisite
9477ce8d-1b8f-46af-8ab4-0ea044253742 in execute_command_request
2019-02-20T19:27:14.001Z|00388|raft|INFO|Dropped 300646 log messages
in last 12 seconds (most recently, 0 seconds ago) due to excessive
rate
2019-02-20T19:27:14.001Z|00389|raft|INFO|current entry eid
c062d334-c6b8-4e8d-9ae8-38ed650c120b does not match prerequisite
a87df6e0-661d-404c-8d7f-ad6de210cc41 in execute_command_request
2019-02-20T19:27:26.001Z|00394|raft|INFO|Dropped 301084 log messages
in last 12 seconds (most recently, 0 seconds ago) due to excessive
rate
2019-02-20T19:27:26.001Z|00395|raft|INFO|current entry eid
0f74e572-777e-4d6e-859a-739fac0293e5 does not match prerequisite
def7f61f-f865-423b-a735-5c3c78d982e8 in execute_command_request

This is ~30K failures per second. This is caused by the prereq check
failures and continuously retries from the followers.

My first reaction was to avoid the continuous retry from followers by
checking if the db's prereq has changed, since it makes no sense to
retry with same prereq as the previous failed attempt. I made the
change and tried again. It did solve the CPU problem, but revealed two
new problems.

Problem 1):
There were still hundreds of prerequisite check failures per second
seen from the leader's logs and whole scale test execution was
extremely slow, just as the previous high CPU load situation (6 hours
to complete the whole testing). By checking Raft messages in debug
logs, I found out that this is because there are so many clients
requesting updates to the DB, and each follower can pile up a batch of
transactions from different clients and then send them one by one to
the leader, all with *the same prereq*. In this case, even if there is
no racing with other nodes, the requests from the same follower will
end up only the first one being succeeded, and the later ones will
fail the prereq check for sure because the first transaction proposed
by the same follower would cause the leader update its expected
prereq. I tried another patch to solve the problem, by updating the
follower's "prereq" to the "result" - the eid of the last proposed
commit, so that the *batch* of transaction proposed by the same
follower would either all succeed if there is no racing from other
nodes, or all fail if there is racing. This patch worked, and the
whole testing completed in 2 hours, similar to the result when using
standalone DB. However, I realized that this trick violates the basic
transactional correctness. The proposed commit should be based on a
previously committed version. If the previous proposed transaction in
the *batch* is not in committed state, setting its eid as prereq is
wrong. Going further, the attempt to propose a batch of transaction
seems to be wrong - each transaction should be examined on top of
previously committed transactions (unless we are doing concurrency
control such as MVCC, but obviously it is not the case in current
OVSDB design). So, the prereq has to be the last applied eid on the
follower side (as is implemented today), to ensure the correctness.

Because of this, the only optimization we can do with the prereq
checking mechanism is probably proposing commits to leader only when:
a) the raft applied index is the same as the latest raft eid in the
follower's raft log - because otherwise it indicates there are new
proposal already happened from other nodes and it will be just waste
of CPU to propose a commit which would fail the prereq check; and b)
there is no pending proposal from current follower node that has been
sent to the leader with the same prereq - because otherwise it
indicates the following proposal would fail the prereq check, too.

However, this optimization can solve the CPU load problem, but it
would not solve the performance problem. The scale test would still be
slow, due to the problem 2) described below.

Problem 2):
There is a latency introduced by the prereq check mechanism. Take the
simple scenario as example, assume there is no racing with other
nodes, and the follower A is currently the only node proposing
commits. Assume there are 3 transactions T1 - T3 accumulated in
current main loop iteration to be sent to the leader. The Raft message
sequence would be like following:

1. Follower A: send CommandExec(T1) using prereq P1
2. Leader: handle T1, send AppendReq(T1) to all followers
3. Follower A: receive AppendReq(T1), update local log_end (but not
updating commit_index or last_applied), reply to leader
4. Leader: received major replies, reply success to Follower A for the
CommandExec(T1), with the new commit_index
5. Follower A: handle reply for CommandExec(T1). However, it does
nothing but some cleanup. The commit_index in the reply seems not
used. (This is correct since Raft doesn't suggest any action at this
point)
6. ... 1/3 heartbeat timer reached ...
7. Leader: send AppendReq(heartbeat) with the new commit_index to follower A
8. Follower A: handle AppendReq(heartbeat) and update local
commit_index, and now follower A can apply T1 to its local committed
data, and T2 can be proposed to leader with the new prereq P2.
9. Follower A: send CommandExec(T2) using prereq P2.
...

As we can see, just to complete T1 there is a latency caused by the
hearbeat timer at step 6, and T2 cannot start until step 8 (it can
start earlier like how it is currently implemented but only could
result in prereq check failure and useless retries wasting more CPU
cycles). So the batch of transactions in the follower A have to be
executed one by one, each wait at least 1/3 heartheat timer. This is
causing big latencies and very low performance overall. An
optimization may be done at step 5 when follower receives CommandExec
reply, to move forward the local commit_index without waiting for the
next heartbeat. However, I am not sure if it introduces any problem
since it is not something mentioned in Raft paper.

Because of the above problems, I believe the alternative approach
would be better than the prereq check mechanism and solve the problems
easily. Instead of sending the final result to leader, followers can
simply forward the "transaction" PRC from client to leader, and let
leader perform the ovsdb_execute_compose() and ovsdb_txn_precommit()
on top of the latest data on leader. This way, both problems are
solved:

For problem 1), since leader will do the real transaction parsing and
execution, the prerequisite is not needed - in fact there is no
defined order between the transactions forwarded by followers, since
they originally came from different clients, so leader has the freedom
to apply them in any order it considers as consistent. For problem 2),
a batch of request can be sent to leader without delaying. Only the
final commit_index update for last transaction in the batch will
experience the delay caused by the heartbeat timer (similar as in step
6 above) (which seems to be inherent problem of the raft design?)

There are side-effects of this alternative approach. The leader will
do a little more work than before, but compared with the wasted CPU
for handling the useless failure retries this is nothing. Even if the
previously mentioned optimizations are implemented for the prereq
mechanism to avoid the CPU waste, the alternative approach would still
beat the prereq mechanism because of the overall performance advantage
(because there is no delaying for sending request to leader).

Ben, please correct me if I misunderstood anything and suggest the next steps.

Thanks,
Han
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to