On Tue, Feb 26, 2019 at 1:57 PM Han Zhou <[email protected]> wrote: > > I have been working on OVSDB Raft scale testing problems debugging > recently and here is a major finding. I thought there were different > problems but it turned out that the only major problem is the > prerequisite checking mechanism. My conclusion is that we'd better > abandon the prerequisite checking and change to the alternative > approach mentioned by Ben at an earlier OVN meeting: the follower > sends the operations requested by clients directly to leader, without > any prerequisite assumption. Here is the detailed analysis. > > The test environment is for ovn-scale-test, with 3-node clusters for > NB and SB. There are 1k HVs connecting to the SB DB, connections > randomly distributed to the 3 cluster nodes. The test is creating 10k > lports and bind them on the 1k HVs. The bottleneck is in SB DB only, > because of the big number of clients and the continuous write requests > from the clients (port-binding updates). > > The first problem I encountered was that the leader re-election > happened a lot during the testing because the leader wasn't able to > respond to heartbeat messages in time, due to the CPU load. The ideal > solution should be using dedicated thread for Raft messages handling, > but I tried to circumvent the problem and see what happens by > increasing the election base time from 1024ms to 10000ms. The > heartbeat messages are sent at 1/3 of the expire time, so now it is > sent out around every 3+ sec. > > With the change the leader re-election problem solved. However, it > reveals the real problem why CPU load was high. The CPU load stayed at > 100% on the leader, and the INFO logs on prerequisite check failure > was extremely high. For example: > 2019-02-20T19:27:02.001Z|00382|raft|INFO|Dropped 299433 log messages > in last 12 seconds (most recently, 0 seconds ago) due to excessive > rate > 2019-02-20T19:27:02.001Z|00383|raft|INFO|current entry eid > b4b77aaf-4f69-449b-a689-c6c2a02f452f does not match prerequisite > 9477ce8d-1b8f-46af-8ab4-0ea044253742 in execute_command_request > 2019-02-20T19:27:14.001Z|00388|raft|INFO|Dropped 300646 log messages > in last 12 seconds (most recently, 0 seconds ago) due to excessive > rate > 2019-02-20T19:27:14.001Z|00389|raft|INFO|current entry eid > c062d334-c6b8-4e8d-9ae8-38ed650c120b does not match prerequisite > a87df6e0-661d-404c-8d7f-ad6de210cc41 in execute_command_request > 2019-02-20T19:27:26.001Z|00394|raft|INFO|Dropped 301084 log messages > in last 12 seconds (most recently, 0 seconds ago) due to excessive > rate > 2019-02-20T19:27:26.001Z|00395|raft|INFO|current entry eid > 0f74e572-777e-4d6e-859a-739fac0293e5 does not match prerequisite > def7f61f-f865-423b-a735-5c3c78d982e8 in execute_command_request > > This is ~30K failures per second. This is caused by the prereq check > failures and continuously retries from the followers. > > My first reaction was to avoid the continuous retry from followers by > checking if the db's prereq has changed, since it makes no sense to > retry with same prereq as the previous failed attempt. I made the > change and tried again. It did solve the CPU problem, but revealed two > new problems. > > Problem 1): > There were still hundreds of prerequisite check failures per second > seen from the leader's logs and whole scale test execution was > extremely slow, just as the previous high CPU load situation (6 hours > to complete the whole testing). By checking Raft messages in debug > logs, I found out that this is because there are so many clients > requesting updates to the DB, and each follower can pile up a batch of > transactions from different clients and then send them one by one to > the leader, all with *the same prereq*. In this case, even if there is > no racing with other nodes, the requests from the same follower will > end up only the first one being succeeded, and the later ones will > fail the prereq check for sure because the first transaction proposed > by the same follower would cause the leader update its expected > prereq. I tried another patch to solve the problem, by updating the > follower's "prereq" to the "result" - the eid of the last proposed > commit, so that the *batch* of transaction proposed by the same > follower would either all succeed if there is no racing from other > nodes, or all fail if there is racing. This patch worked, and the > whole testing completed in 2 hours, similar to the result when using > standalone DB. However, I realized that this trick violates the basic > transactional correctness. The proposed commit should be based on a > previously committed version. If the previous proposed transaction in > the *batch* is not in committed state, setting its eid as prereq is > wrong. Going further, the attempt to propose a batch of transaction > seems to be wrong - each transaction should be examined on top of > previously committed transactions (unless we are doing concurrency > control such as MVCC, but obviously it is not the case in current > OVSDB design). So, the prereq has to be the last applied eid on the > follower side (as is implemented today), to ensure the correctness. > > Because of this, the only optimization we can do with the prereq > checking mechanism is probably proposing commits to leader only when: > a) the raft applied index is the same as the latest raft eid in the > follower's raft log - because otherwise it indicates there are new > proposal already happened from other nodes and it will be just waste > of CPU to propose a commit which would fail the prereq check; and b) > there is no pending proposal from current follower node that has been > sent to the leader with the same prereq - because otherwise it > indicates the following proposal would fail the prereq check, too.
I submitted a patch for a), which is a low hanging fruit. b) is not difficult to do but it seems not providing big value since a) already solves the CPU problem (but not solving the latency - "problem 2" described below). > > However, this optimization can solve the CPU load problem, but it > would not solve the performance problem. The scale test would still be > slow, due to the problem 2) described below. > > Problem 2): > There is a latency introduced by the prereq check mechanism. Take the > simple scenario as example, assume there is no racing with other > nodes, and the follower A is currently the only node proposing > commits. Assume there are 3 transactions T1 - T3 accumulated in > current main loop iteration to be sent to the leader. The Raft message > sequence would be like following: > > 1. Follower A: send CommandExec(T1) using prereq P1 > 2. Leader: handle T1, send AppendReq(T1) to all followers > 3. Follower A: receive AppendReq(T1), update local log_end (but not > updating commit_index or last_applied), reply to leader > 4. Leader: received major replies, reply success to Follower A for the > CommandExec(T1), with the new commit_index > 5. Follower A: handle reply for CommandExec(T1). However, it does > nothing but some cleanup. The commit_index in the reply seems not > used. (This is correct since Raft doesn't suggest any action at this > point) > 6. ... 1/3 heartbeat timer reached ... > 7. Leader: send AppendReq(heartbeat) with the new commit_index to follower A > 8. Follower A: handle AppendReq(heartbeat) and update local > commit_index, and now follower A can apply T1 to its local committed > data, and T2 can be proposed to leader with the new prereq P2. > 9. Follower A: send CommandExec(T2) using prereq P2. > ... > > As we can see, just to complete T1 there is a latency caused by the > hearbeat timer at step 6, and T2 cannot start until step 8 (it can > start earlier like how it is currently implemented but only could > result in prereq check failure and useless retries wasting more CPU > cycles). So the batch of transactions in the follower A have to be > executed one by one, each wait at least 1/3 heartheat timer. This is > causing big latencies and very low performance overall. An > optimization may be done at step 5 when follower receives CommandExec > reply, to move forward the local commit_index without waiting for the > next heartbeat. However, I am not sure if it introduces any problem > since it is not something mentioned in Raft paper. > > Because of the above problems, I believe the alternative approach > would be better than the prereq check mechanism and solve the problems > easily. Instead of sending the final result to leader, followers can > simply forward the "transaction" PRC from client to leader, and let > leader perform the ovsdb_execute_compose() and ovsdb_txn_precommit() > on top of the latest data on leader. This way, both problems are > solved: > > For problem 1), since leader will do the real transaction parsing and > execution, the prerequisite is not needed - in fact there is no > defined order between the transactions forwarded by followers, since > they originally came from different clients, so leader has the freedom > to apply them in any order it considers as consistent. For problem 2), > a batch of request can be sent to leader without delaying. Only the > final commit_index update for last transaction in the batch will > experience the delay caused by the heartbeat timer (similar as in step > 6 above) (which seems to be inherent problem of the raft design?) > > There are side-effects of this alternative approach. The leader will > do a little more work than before, but compared with the wasted CPU > for handling the useless failure retries this is nothing. Even if the > previously mentioned optimizations are implemented for the prereq > mechanism to avoid the CPU waste, the alternative approach would still > beat the prereq mechanism because of the overall performance advantage > (because there is no delaying for sending request to leader). > > Ben, please correct me if I misunderstood anything and suggest the next steps. > > Thanks, > Han _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
