I have been working on OVSDB Raft scale testing problems debugging recently and here is a major finding. I thought there were different problems but it turned out that the only major problem is the prerequisite checking mechanism. My conclusion is that we'd better abandon the prerequisite checking and change to the alternative approach mentioned by Ben at an earlier OVN meeting: the follower sends the operations requested by clients directly to leader, without any prerequisite assumption. Here is the detailed analysis.
The test environment is for ovn-scale-test, with 3-node clusters for NB and SB. There are 1k HVs connecting to the SB DB, connections randomly distributed to the 3 cluster nodes. The test is creating 10k lports and bind them on the 1k HVs. The bottleneck is in SB DB only, because of the big number of clients and the continuous write requests from the clients (port-binding updates). The first problem I encountered was that the leader re-election happened a lot during the testing because the leader wasn't able to respond to heartbeat messages in time, due to the CPU load. The ideal solution should be using dedicated thread for Raft messages handling, but I tried to circumvent the problem and see what happens by increasing the election base time from 1024ms to 10000ms. The heartbeat messages are sent at 1/3 of the expire time, so now it is sent out around every 3+ sec. With the change the leader re-election problem solved. However, it reveals the real problem why CPU load was high. The CPU load stayed at 100% on the leader, and the INFO logs on prerequisite check failure was extremely high. For example: 2019-02-20T19:27:02.001Z|00382|raft|INFO|Dropped 299433 log messages in last 12 seconds (most recently, 0 seconds ago) due to excessive rate 2019-02-20T19:27:02.001Z|00383|raft|INFO|current entry eid b4b77aaf-4f69-449b-a689-c6c2a02f452f does not match prerequisite 9477ce8d-1b8f-46af-8ab4-0ea044253742 in execute_command_request 2019-02-20T19:27:14.001Z|00388|raft|INFO|Dropped 300646 log messages in last 12 seconds (most recently, 0 seconds ago) due to excessive rate 2019-02-20T19:27:14.001Z|00389|raft|INFO|current entry eid c062d334-c6b8-4e8d-9ae8-38ed650c120b does not match prerequisite a87df6e0-661d-404c-8d7f-ad6de210cc41 in execute_command_request 2019-02-20T19:27:26.001Z|00394|raft|INFO|Dropped 301084 log messages in last 12 seconds (most recently, 0 seconds ago) due to excessive rate 2019-02-20T19:27:26.001Z|00395|raft|INFO|current entry eid 0f74e572-777e-4d6e-859a-739fac0293e5 does not match prerequisite def7f61f-f865-423b-a735-5c3c78d982e8 in execute_command_request This is ~30K failures per second. This is caused by the prereq check failures and continuously retries from the followers. My first reaction was to avoid the continuous retry from followers by checking if the db's prereq has changed, since it makes no sense to retry with same prereq as the previous failed attempt. I made the change and tried again. It did solve the CPU problem, but revealed two new problems. Problem 1): There were still hundreds of prerequisite check failures per second seen from the leader's logs and whole scale test execution was extremely slow, just as the previous high CPU load situation (6 hours to complete the whole testing). By checking Raft messages in debug logs, I found out that this is because there are so many clients requesting updates to the DB, and each follower can pile up a batch of transactions from different clients and then send them one by one to the leader, all with *the same prereq*. In this case, even if there is no racing with other nodes, the requests from the same follower will end up only the first one being succeeded, and the later ones will fail the prereq check for sure because the first transaction proposed by the same follower would cause the leader update its expected prereq. I tried another patch to solve the problem, by updating the follower's "prereq" to the "result" - the eid of the last proposed commit, so that the *batch* of transaction proposed by the same follower would either all succeed if there is no racing from other nodes, or all fail if there is racing. This patch worked, and the whole testing completed in 2 hours, similar to the result when using standalone DB. However, I realized that this trick violates the basic transactional correctness. The proposed commit should be based on a previously committed version. If the previous proposed transaction in the *batch* is not in committed state, setting its eid as prereq is wrong. Going further, the attempt to propose a batch of transaction seems to be wrong - each transaction should be examined on top of previously committed transactions (unless we are doing concurrency control such as MVCC, but obviously it is not the case in current OVSDB design). So, the prereq has to be the last applied eid on the follower side (as is implemented today), to ensure the correctness. Because of this, the only optimization we can do with the prereq checking mechanism is probably proposing commits to leader only when: a) the raft applied index is the same as the latest raft eid in the follower's raft log - because otherwise it indicates there are new proposal already happened from other nodes and it will be just waste of CPU to propose a commit which would fail the prereq check; and b) there is no pending proposal from current follower node that has been sent to the leader with the same prereq - because otherwise it indicates the following proposal would fail the prereq check, too. However, this optimization can solve the CPU load problem, but it would not solve the performance problem. The scale test would still be slow, due to the problem 2) described below. Problem 2): There is a latency introduced by the prereq check mechanism. Take the simple scenario as example, assume there is no racing with other nodes, and the follower A is currently the only node proposing commits. Assume there are 3 transactions T1 - T3 accumulated in current main loop iteration to be sent to the leader. The Raft message sequence would be like following: 1. Follower A: send CommandExec(T1) using prereq P1 2. Leader: handle T1, send AppendReq(T1) to all followers 3. Follower A: receive AppendReq(T1), update local log_end (but not updating commit_index or last_applied), reply to leader 4. Leader: received major replies, reply success to Follower A for the CommandExec(T1), with the new commit_index 5. Follower A: handle reply for CommandExec(T1). However, it does nothing but some cleanup. The commit_index in the reply seems not used. (This is correct since Raft doesn't suggest any action at this point) 6. ... 1/3 heartbeat timer reached ... 7. Leader: send AppendReq(heartbeat) with the new commit_index to follower A 8. Follower A: handle AppendReq(heartbeat) and update local commit_index, and now follower A can apply T1 to its local committed data, and T2 can be proposed to leader with the new prereq P2. 9. Follower A: send CommandExec(T2) using prereq P2. ... As we can see, just to complete T1 there is a latency caused by the hearbeat timer at step 6, and T2 cannot start until step 8 (it can start earlier like how it is currently implemented but only could result in prereq check failure and useless retries wasting more CPU cycles). So the batch of transactions in the follower A have to be executed one by one, each wait at least 1/3 heartheat timer. This is causing big latencies and very low performance overall. An optimization may be done at step 5 when follower receives CommandExec reply, to move forward the local commit_index without waiting for the next heartbeat. However, I am not sure if it introduces any problem since it is not something mentioned in Raft paper. Because of the above problems, I believe the alternative approach would be better than the prereq check mechanism and solve the problems easily. Instead of sending the final result to leader, followers can simply forward the "transaction" PRC from client to leader, and let leader perform the ovsdb_execute_compose() and ovsdb_txn_precommit() on top of the latest data on leader. This way, both problems are solved: For problem 1), since leader will do the real transaction parsing and execution, the prerequisite is not needed - in fact there is no defined order between the transactions forwarded by followers, since they originally came from different clients, so leader has the freedom to apply them in any order it considers as consistent. For problem 2), a batch of request can be sent to leader without delaying. Only the final commit_index update for last transaction in the batch will experience the delay caused by the heartbeat timer (similar as in step 6 above) (which seems to be inherent problem of the raft design?) There are side-effects of this alternative approach. The leader will do a little more work than before, but compared with the wasted CPU for handling the useless failure retries this is nothing. Even if the previously mentioned optimizations are implemented for the prereq mechanism to avoid the CPU waste, the alternative approach would still beat the prereq mechanism because of the overall performance advantage (because there is no delaying for sending request to leader). Ben, please correct me if I misunderstood anything and suggest the next steps. Thanks, Han _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
