On Fri, Mar 1, 2019 at 11:13 AM Han Zhou <[email protected]> wrote: > > On Tue, Feb 26, 2019 at 1:57 PM Han Zhou <[email protected]> wrote: > > > > I have been working on OVSDB Raft scale testing problems debugging > > recently and here is a major finding. I thought there were different > > problems but it turned out that the only major problem is the > > prerequisite checking mechanism. My conclusion is that we'd better > > abandon the prerequisite checking and change to the alternative > > approach mentioned by Ben at an earlier OVN meeting: the follower > > sends the operations requested by clients directly to leader, without > > any prerequisite assumption. Here is the detailed analysis. > > > > The test environment is for ovn-scale-test, with 3-node clusters for > > NB and SB. There are 1k HVs connecting to the SB DB, connections > > randomly distributed to the 3 cluster nodes. The test is creating 10k > > lports and bind them on the 1k HVs. The bottleneck is in SB DB only, > > because of the big number of clients and the continuous write requests > > from the clients (port-binding updates). > > > > The first problem I encountered was that the leader re-election > > happened a lot during the testing because the leader wasn't able to > > respond to heartbeat messages in time, due to the CPU load. The ideal > > solution should be using dedicated thread for Raft messages handling, > > but I tried to circumvent the problem and see what happens by > > increasing the election base time from 1024ms to 10000ms. The > > heartbeat messages are sent at 1/3 of the expire time, so now it is > > sent out around every 3+ sec. > > > > With the change the leader re-election problem solved. However, it > > reveals the real problem why CPU load was high. The CPU load stayed at > > 100% on the leader, and the INFO logs on prerequisite check failure > > was extremely high. For example: > > 2019-02-20T19:27:02.001Z|00382|raft|INFO|Dropped 299433 log messages > > in last 12 seconds (most recently, 0 seconds ago) due to excessive > > rate > > 2019-02-20T19:27:02.001Z|00383|raft|INFO|current entry eid > > b4b77aaf-4f69-449b-a689-c6c2a02f452f does not match prerequisite > > 9477ce8d-1b8f-46af-8ab4-0ea044253742 in execute_command_request > > 2019-02-20T19:27:14.001Z|00388|raft|INFO|Dropped 300646 log messages > > in last 12 seconds (most recently, 0 seconds ago) due to excessive > > rate > > 2019-02-20T19:27:14.001Z|00389|raft|INFO|current entry eid > > c062d334-c6b8-4e8d-9ae8-38ed650c120b does not match prerequisite > > a87df6e0-661d-404c-8d7f-ad6de210cc41 in execute_command_request > > 2019-02-20T19:27:26.001Z|00394|raft|INFO|Dropped 301084 log messages > > in last 12 seconds (most recently, 0 seconds ago) due to excessive > > rate > > 2019-02-20T19:27:26.001Z|00395|raft|INFO|current entry eid > > 0f74e572-777e-4d6e-859a-739fac0293e5 does not match prerequisite > > def7f61f-f865-423b-a735-5c3c78d982e8 in execute_command_request > > > > This is ~30K failures per second. This is caused by the prereq check > > failures and continuously retries from the followers. > > > > My first reaction was to avoid the continuous retry from followers by > > checking if the db's prereq has changed, since it makes no sense to > > retry with same prereq as the previous failed attempt. I made the > > change and tried again. It did solve the CPU problem, but revealed two > > new problems. > > > > Problem 1): > > There were still hundreds of prerequisite check failures per second > > seen from the leader's logs and whole scale test execution was > > extremely slow, just as the previous high CPU load situation (6 hours > > to complete the whole testing). By checking Raft messages in debug > > logs, I found out that this is because there are so many clients > > requesting updates to the DB, and each follower can pile up a batch of > > transactions from different clients and then send them one by one to > > the leader, all with *the same prereq*. In this case, even if there is > > no racing with other nodes, the requests from the same follower will > > end up only the first one being succeeded, and the later ones will > > fail the prereq check for sure because the first transaction proposed > > by the same follower would cause the leader update its expected > > prereq. I tried another patch to solve the problem, by updating the > > follower's "prereq" to the "result" - the eid of the last proposed > > commit, so that the *batch* of transaction proposed by the same > > follower would either all succeed if there is no racing from other > > nodes, or all fail if there is racing. This patch worked, and the > > whole testing completed in 2 hours, similar to the result when using > > standalone DB. However, I realized that this trick violates the basic > > transactional correctness. The proposed commit should be based on a > > previously committed version. If the previous proposed transaction in > > the *batch* is not in committed state, setting its eid as prereq is > > wrong. Going further, the attempt to propose a batch of transaction > > seems to be wrong - each transaction should be examined on top of > > previously committed transactions (unless we are doing concurrency > > control such as MVCC, but obviously it is not the case in current > > OVSDB design). So, the prereq has to be the last applied eid on the > > follower side (as is implemented today), to ensure the correctness. > > > > Because of this, the only optimization we can do with the prereq > > checking mechanism is probably proposing commits to leader only when: > > a) the raft applied index is the same as the latest raft eid in the > > follower's raft log - because otherwise it indicates there are new > > proposal already happened from other nodes and it will be just waste > > of CPU to propose a commit which would fail the prereq check; and b) > > there is no pending proposal from current follower node that has been > > sent to the leader with the same prereq - because otherwise it > > indicates the following proposal would fail the prereq check, too. > > I submitted a patch for a), which is a low hanging fruit. b) is not > difficult to do but it seems not providing big value since a) already > solves the CPU problem (but not solving the latency - "problem 2" > described below).
Sorry, I forgot to paste link to the patch: https://mail.openvswitch.org/pipermail/ovs-dev/2019-March/356891.html > > > > > However, this optimization can solve the CPU load problem, but it > > would not solve the performance problem. The scale test would still be > > slow, due to the problem 2) described below. > > > > Problem 2): > > There is a latency introduced by the prereq check mechanism. Take the > > simple scenario as example, assume there is no racing with other > > nodes, and the follower A is currently the only node proposing > > commits. Assume there are 3 transactions T1 - T3 accumulated in > > current main loop iteration to be sent to the leader. The Raft message > > sequence would be like following: > > > > 1. Follower A: send CommandExec(T1) using prereq P1 > > 2. Leader: handle T1, send AppendReq(T1) to all followers > > 3. Follower A: receive AppendReq(T1), update local log_end (but not > > updating commit_index or last_applied), reply to leader > > 4. Leader: received major replies, reply success to Follower A for the > > CommandExec(T1), with the new commit_index > > 5. Follower A: handle reply for CommandExec(T1). However, it does > > nothing but some cleanup. The commit_index in the reply seems not > > used. (This is correct since Raft doesn't suggest any action at this > > point) > > 6. ... 1/3 heartbeat timer reached ... > > 7. Leader: send AppendReq(heartbeat) with the new commit_index to follower A > > 8. Follower A: handle AppendReq(heartbeat) and update local > > commit_index, and now follower A can apply T1 to its local committed > > data, and T2 can be proposed to leader with the new prereq P2. > > 9. Follower A: send CommandExec(T2) using prereq P2. > > ... > > > > As we can see, just to complete T1 there is a latency caused by the > > hearbeat timer at step 6, and T2 cannot start until step 8 (it can > > start earlier like how it is currently implemented but only could > > result in prereq check failure and useless retries wasting more CPU > > cycles). So the batch of transactions in the follower A have to be > > executed one by one, each wait at least 1/3 heartheat timer. This is > > causing big latencies and very low performance overall. An > > optimization may be done at step 5 when follower receives CommandExec > > reply, to move forward the local commit_index without waiting for the > > next heartbeat. However, I am not sure if it introduces any problem > > since it is not something mentioned in Raft paper. > > > > Because of the above problems, I believe the alternative approach > > would be better than the prereq check mechanism and solve the problems > > easily. Instead of sending the final result to leader, followers can > > simply forward the "transaction" PRC from client to leader, and let > > leader perform the ovsdb_execute_compose() and ovsdb_txn_precommit() > > on top of the latest data on leader. This way, both problems are > > solved: > > > > For problem 1), since leader will do the real transaction parsing and > > execution, the prerequisite is not needed - in fact there is no > > defined order between the transactions forwarded by followers, since > > they originally came from different clients, so leader has the freedom > > to apply them in any order it considers as consistent. For problem 2), > > a batch of request can be sent to leader without delaying. Only the > > final commit_index update for last transaction in the batch will > > experience the delay caused by the heartbeat timer (similar as in step > > 6 above) (which seems to be inherent problem of the raft design?) > > > > There are side-effects of this alternative approach. The leader will > > do a little more work than before, but compared with the wasted CPU > > for handling the useless failure retries this is nothing. Even if the > > previously mentioned optimizations are implemented for the prereq > > mechanism to avoid the CPU waste, the alternative approach would still > > beat the prereq mechanism because of the overall performance advantage > > (because there is no delaying for sending request to leader). > > > > Ben, please correct me if I misunderstood anything and suggest the next > > steps. > > > > Thanks, > > Han _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
