On 6/27/24 00:02, Ilya Maximets wrote: > Every transaction has RAFT log prerequisites. Even if transactions > are not related (because RAFT doesn't actually know what data it is > handling). When leader writes a new record to a RAFT storage, it is > getting appended to the log right away and changes current 'eid', > i.e., changes prerequisites. The leader will not try to write new > records until the current one is committed, because until then the > pre-check will be failing. > > However, that is different for the follower. Followers do not add > records to the RAFT log until the leader sends an append request back. > So, if there are multiple transactions pending on a follower, it will > create a command for each of them and prerequisites will be set to the > same values. All these commands will be sent to the leader, but only > one can succeed at a time, because accepting one command immediately > changes prerequisites and all other commands become non-applicable. > So, out of N commands, 1 will succeed and N - 1 will fail. The cluster > failure is a transient failure, so the follower will re-process all the > failed transactions and send them again. 1 will succeed and N - 2 will > fail. And so on, until there are no more transactions. In the end, > instead of processing N transactions, the follower is performing > N * (N - 1) / 2 transaction processing iterations. That is consuming > a huge amount of CPU resources completely unnecessarily. > > Since there is no real chance for multiple transactions from the same > follower to succeed, it's better to not send them in the first place. > This also eliminates prerequisite mismatch messages on a leader in > this particular case. >
Makes sense! > In a test with 30 parallel shell threads executing 12K transactions > total with separate ovsdb-client calls through the same follower there > is about 60% performance improvement. The test takes ~100 seconds to > complete without this change and ~40 seconds with this change applied. > The new time is very close to what it takes to execute the same test > through the cluster leader. > Maybe a public link to the test (if possible) would be a nice reference to have in the future? > Note: prerequisite failures on a leader are still possible, but mostly > in a case of simultaneous transactions from different followers. It's > a normal thing for a distributed database due to its nature. > > Signed-off-by: Ilya Maximets <[email protected]> > --- > ovsdb/raft.c | 45 ++++++++++++++++++++++++++++++++++++++++++++- > ovsdb/raft.h | 2 +- > ovsdb/storage.c | 9 +++++---- > ovsdb/storage.h | 5 ++++- > ovsdb/transaction.c | 6 +----- > 5 files changed, 55 insertions(+), 12 deletions(-) > > diff --git a/ovsdb/raft.c b/ovsdb/raft.c > index ac3d37ac4..116924058 100644 > --- a/ovsdb/raft.c > +++ b/ovsdb/raft.c > @@ -2307,12 +2307,55 @@ raft_get_eid(const struct raft *raft, uint64_t index) > return &raft->snap.eid; > } > > -const struct uuid * > +static const struct uuid * > raft_current_eid(const struct raft *raft) > { > return raft_get_eid(raft, raft->log_end - 1); > } > > +bool > +raft_precheck_prereq(const struct raft *raft, const struct uuid *prereq) > +{ > + if (!uuid_equals(raft_current_eid(raft), prereq)) { > + VLOG_DBG("%s: prerequisites (" UUID_FMT ") " > + "do not match current eid (" UUID_FMT ")", > + __func__, UUID_ARGS(prereq), > + UUID_ARGS(raft_current_eid(raft))); > + return false; > + } > + > + /* Having incomplete commands on a follower means that the leader has > + * these commands and they will change the prerequisites once added to > + * the leader's log. > + * > + * There is a chance that all these commands will actually fail and the > + * record with current prerequisites will in fact succeed, but, since > + * these are our own commands, the chances are low. > + * > + * Incomplete commands on a leader will not change the leader's current > + * 'eid' on commit as they are already part of the leader's log. */ > + if (raft->role != RAFT_LEADER && hmap_count(&raft->commands)) { > + struct raft_command *cmd; > + > + HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) { Nit: I'd rewrite this as: if (raft->role == RAFT_LEADER) { return true; } HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) { ... } > + /* Skip commands that are already part of the log (have non-zero > + * index) and ones that do not carry any data (have zero 'eid'), > + * as they can't change prerequisites. > + * > + * Database will not re-run triggers unless the data changes or > + * one of the data-carrying triggers completes. So, pre-check > must > + * not fail if there are no outstanding data-carrying commands. > */ > + if (!cmd->index && !uuid_is_zero(&cmd->eid)) { > + VLOG_DBG("%s: follower still has an incomplete command " > + UUID_FMT, __func__, UUID_ARGS(&cmd->eid)); > + return false; > + } > + } > + } > + > + return true; > +} > + > static struct raft_command * > raft_command_create_completed(enum raft_command_status status) > { > diff --git a/ovsdb/raft.h b/ovsdb/raft.h > index a5b55d9bf..5833aaf23 100644 > --- a/ovsdb/raft.h > +++ b/ovsdb/raft.h > @@ -189,5 +189,5 @@ struct ovsdb_error *raft_store_snapshot(struct raft *, > void raft_take_leadership(struct raft *); > void raft_transfer_leadership(struct raft *, const char *reason); > > -const struct uuid *raft_current_eid(const struct raft *); > +bool raft_precheck_prereq(const struct raft *, const struct uuid *prereq); > #endif /* lib/raft.h */ > diff --git a/ovsdb/storage.c b/ovsdb/storage.c > index 6c395106c..c5aec5459 100644 > --- a/ovsdb/storage.c > +++ b/ovsdb/storage.c > @@ -661,11 +661,12 @@ ovsdb_storage_write_schema_change(struct ovsdb_storage > *storage, > return w; > } > > -const struct uuid * > -ovsdb_storage_peek_last_eid(struct ovsdb_storage *storage) > +bool > +ovsdb_storage_precheck_prereq(const struct ovsdb_storage *storage, > + const struct uuid *prereq) > { > if (!storage->raft) { > - return NULL; > + return true; > } > - return raft_current_eid(storage->raft); > + return raft_precheck_prereq(storage->raft, prereq); > } > diff --git a/ovsdb/storage.h b/ovsdb/storage.h > index 05f40ce93..7079ea261 100644 > --- a/ovsdb/storage.h > +++ b/ovsdb/storage.h > @@ -96,6 +96,9 @@ struct ovsdb_storage *ovsdb_storage_open_standalone(const > char *filename, > bool rw); > struct ovsdb_schema *ovsdb_storage_read_schema(struct ovsdb_storage *); > > -const struct uuid *ovsdb_storage_peek_last_eid(struct ovsdb_storage *); > +/* Checks that there is a chance for a record with specified prerequisites > + * to be successfully written to the storage. */ > +bool ovsdb_storage_precheck_prereq(const struct ovsdb_storage *, > + const struct uuid *prereq); > > #endif /* ovsdb/storage.h */ > diff --git a/ovsdb/transaction.c b/ovsdb/transaction.c > index 484a88e1c..65eca6478 100644 > --- a/ovsdb/transaction.c > +++ b/ovsdb/transaction.c > @@ -1277,11 +1277,7 @@ struct ovsdb_txn_progress { > bool > ovsdb_txn_precheck_prereq(const struct ovsdb *db) > { > - const struct uuid *eid = ovsdb_storage_peek_last_eid(db->storage); > - if (!eid) { > - return true; > - } > - return uuid_equals(&db->prereq, eid); > + return ovsdb_storage_precheck_prereq(db->storage, &db->prereq); > } Aside from the comment above, the patch looks good to me. If you address my comment.. or if you decide to dismiss it too :) then: Acked-by: Dumitru Ceara <[email protected]> This part is not related to the patch itself, but reviewing the change I was wondering: ovsdb_txn_precheck_prereq() is called in ovsdb_trigger_try(), where we return early if the prereq failed for the transaction. But ovsdb_trigger_try() is called by ovsdb_trigger_run() for each pending trigger. If the first trigger failed precheck is there any reason to not break early from the loop there? bool ovsdb_trigger_run(struct ovsdb *db, long long int now) { struct ovsdb_trigger *t; bool run_triggers = db->run_triggers; db->run_triggers_now = db->run_triggers = false; bool disconnect_all = false; LIST_FOR_EACH_SAFE (t, node, &db->triggers) { cooperative_multitasking_yield(); if (run_triggers || now - t->created >= t->timeout_msec || t->progress || t->txn_forward) { if (ovsdb_trigger_try(t, now)) { disconnect_all = true; } } } return disconnect_all; } Regards, Dumitru _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
