On 6/28/24 09:20, Dumitru Ceara wrote:
> On 6/27/24 00:02, Ilya Maximets wrote:
>> Every transaction has RAFT log prerequisites. Even if transactions
>> are not related (because RAFT doesn't actually know what data it is
>> handling). When leader writes a new record to a RAFT storage, it is
>> getting appended to the log right away and changes current 'eid',
>> i.e., changes prerequisites. The leader will not try to write new
>> records until the current one is committed, because until then the
>> pre-check will be failing.
>>
>> However, that is different for the follower. Followers do not add
>> records to the RAFT log until the leader sends an append request back.
>> So, if there are multiple transactions pending on a follower, it will
>> create a command for each of them and prerequisites will be set to the
>> same values. All these commands will be sent to the leader, but only
>> one can succeed at a time, because accepting one command immediately
>> changes prerequisites and all other commands become non-applicable.
>> So, out of N commands, 1 will succeed and N - 1 will fail. The cluster
>> failure is a transient failure, so the follower will re-process all the
>> failed transactions and send them again. 1 will succeed and N - 2 will
>> fail. And so on, until there are no more transactions. In the end,
>> instead of processing N transactions, the follower is performing
>> N * (N - 1) / 2 transaction processing iterations. That is consuming
>> a huge amount of CPU resources completely unnecessarily.
>>
>> Since there is no real chance for multiple transactions from the same
>> follower to succeed, it's better to not send them in the first place.
>> This also eliminates prerequisite mismatch messages on a leader in
>> this particular case.
>>
>
> Makes sense!
>
>> In a test with 30 parallel shell threads executing 12K transactions
>> total with separate ovsdb-client calls through the same follower there
>> is about 60% performance improvement. The test takes ~100 seconds to
>> complete without this change and ~40 seconds with this change applied.
>> The new time is very close to what it takes to execute the same test
>> through the cluster leader.
>>
>
> Maybe a public link to the test (if possible) would be a nice reference
> to have in the future?
I'm not sure where to post it, so I'll just leave it here, and I can
add a link to this post into the commit message.
The script is adding 12K ports across 120 logical switches in a style
of ovn-kubernetes (sets up port security, some external IDs and adds
to an address set):
---
$ cat ../transaction-benchmark-parallel.sh
#set -x
set -o errexit
# Creating 120 logical switches
for i in $(seq 120); do
ovn-nbctl ls-add ip-10-10-177-$i.us-west-2.compute.internal
done
IFS=$'\r\n' GLOBIGNORE='*' \
command eval 'uuids=($(ovn-nbctl --bare --columns _uuid list
logical_switch | sed "/^$/d"))'
for i in $(seq 120); do
echo "uuid #${i}: ${uuids[$(($i - 1))]}"
done
# Creating one addres set
as_name="a0123456789012345678"
as_uuid=$(ovn-nbctl create address_set name=${as_name})
# Adding ports
function add_400() {
rm -rf txn_results$1.txt
for i in $(seq $1 $(($1 + 399)) ); do
echo $i
port_name="node-density-00000000-0000-0000-0000-000000000000_node-density-$i"
namespace="node-density-00000000-0000-0000-0000-000000000000"
ls_name="ip-10-10-177-$((${i} % 120 + 1)).us-west-2.compute.internal"
uuid_name="row00000000_0000_0000_0000_000000000000"
mac=$(printf '0a:58:0a:9b:%02x:%02x' $((${i} >> 8)) $((${i} & 255)))
ip=$(printf "10.11.%d.%d\n" $((${i} / 255)) $((${i} % 255 + 1)))
time ovsdb-client transact unix:$(pwd)/sandbox/nb2.ovsdb
"[\"OVN_Northbound\",
{
\"uuid-name\":\"${uuid_name}\",
\"row\":{
\"name\":\"${port_name}\",
\"options\":[\"map\",[[\"requested-chassis\",\"${ls_name}\"]]],
\"addresses\":[\"set\",[\"${mac} ${ip}\"]],
\"external_ids\":[\"map\",[[\"pod\",\"true\"],[\"namespace\",\"${namespace}\"]]],
\"port_security\":[\"set\",[\"${mac} ${ip}\"]]
},
\"op\":\"insert\", \"table\":\"Logical_Switch_Port\"
},
{
\"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${uuids[$(($i % 120))]}\"]]],
\"mutations\":[[\"ports\",\"insert\",[\"set\",[[\"named-uuid\",\"${uuid_name}\"]]]]],
\"op\":\"mutate\", \"table\":\"Logical_Switch\"
},
{
\"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${as_uuid}\"]]],
\"mutations\":[[\"addresses\",\"insert\",[\"set\",[\"${ip}\"]]]],
\"op\":\"mutate\", \"table\":\"Address_Set\"
}]" >> txn_results$1.txt
done
echo "done" >> txn_results$1.txt
}
rm -f txn_results*.txt
n=0
for i in $(seq 0 400 11600); do
n=$(($n + 1))
add_400 $i &
done
while [ $(grep 'done' txn_results*.txt | wc -l) != $n ]; do
sleep 1
done
---
It is not the most generic script and it is designed for sandbox,
where nb1 is a leader by default:
# make sandbox SANDBOXFLAGS="--nbdb-model=clustered"
# pkill ovn-northd # to remove the noise
# time bash ../transaction-benchmark-parallel.sh
Note that the script doesn't perform a lot of correctness checks, so
it's better to examine the txn_results*.txt files, or check the
database content afterwards to be sure.
>
>> Note: prerequisite failures on a leader are still possible, but mostly
>> in a case of simultaneous transactions from different followers. It's
>> a normal thing for a distributed database due to its nature.
>>
>> Signed-off-by: Ilya Maximets <[email protected]>
>> ---
>> ovsdb/raft.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
>> ovsdb/raft.h | 2 +-
>> ovsdb/storage.c | 9 +++++----
>> ovsdb/storage.h | 5 ++++-
>> ovsdb/transaction.c | 6 +-----
>> 5 files changed, 55 insertions(+), 12 deletions(-)
>>
>> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> index ac3d37ac4..116924058 100644
>> --- a/ovsdb/raft.c
>> +++ b/ovsdb/raft.c
>> @@ -2307,12 +2307,55 @@ raft_get_eid(const struct raft *raft, uint64_t index)
>> return &raft->snap.eid;
>> }
>>
>> -const struct uuid *
>> +static const struct uuid *
>> raft_current_eid(const struct raft *raft)
>> {
>> return raft_get_eid(raft, raft->log_end - 1);
>> }
>>
>> +bool
>> +raft_precheck_prereq(const struct raft *raft, const struct uuid *prereq)
>> +{
>> + if (!uuid_equals(raft_current_eid(raft), prereq)) {
>> + VLOG_DBG("%s: prerequisites (" UUID_FMT ") "
>> + "do not match current eid (" UUID_FMT ")",
>> + __func__, UUID_ARGS(prereq),
>> + UUID_ARGS(raft_current_eid(raft)));
>> + return false;
>> + }
>> +
>> + /* Having incomplete commands on a follower means that the leader has
>> + * these commands and they will change the prerequisites once added to
>> + * the leader's log.
>> + *
>> + * There is a chance that all these commands will actually fail and the
>> + * record with current prerequisites will in fact succeed, but, since
>> + * these are our own commands, the chances are low.
>> + *
>> + * Incomplete commands on a leader will not change the leader's current
>> + * 'eid' on commit as they are already part of the leader's log. */
>> + if (raft->role != RAFT_LEADER && hmap_count(&raft->commands)) {
>> + struct raft_command *cmd;
>> +
>> + HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
>
> Nit: I'd rewrite this as:
>
> if (raft->role == RAFT_LEADER) {
> return true;
> }
>
> HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
> ...
> }
Makes sense, I can do that con commit, unless there will be further comments.
>
>> + /* Skip commands that are already part of the log (have non-zero
>> + * index) and ones that do not carry any data (have zero 'eid'),
>> + * as they can't change prerequisites.
>> + *
>> + * Database will not re-run triggers unless the data changes or
>> + * one of the data-carrying triggers completes. So, pre-check
>> must
>> + * not fail if there are no outstanding data-carrying commands.
>> */
>> + if (!cmd->index && !uuid_is_zero(&cmd->eid)) {
>> + VLOG_DBG("%s: follower still has an incomplete command "
>> + UUID_FMT, __func__, UUID_ARGS(&cmd->eid));
>> + return false;
>> + }
>> + }
>> + }
>> +
>> + return true;
>> +}
>> +
>> static struct raft_command *
>> raft_command_create_completed(enum raft_command_status status)
>> {
>> diff --git a/ovsdb/raft.h b/ovsdb/raft.h
>> index a5b55d9bf..5833aaf23 100644
>> --- a/ovsdb/raft.h
>> +++ b/ovsdb/raft.h
>> @@ -189,5 +189,5 @@ struct ovsdb_error *raft_store_snapshot(struct raft *,
>> void raft_take_leadership(struct raft *);
>> void raft_transfer_leadership(struct raft *, const char *reason);
>>
>> -const struct uuid *raft_current_eid(const struct raft *);
>> +bool raft_precheck_prereq(const struct raft *, const struct uuid *prereq);
>> #endif /* lib/raft.h */
>> diff --git a/ovsdb/storage.c b/ovsdb/storage.c
>> index 6c395106c..c5aec5459 100644
>> --- a/ovsdb/storage.c
>> +++ b/ovsdb/storage.c
>> @@ -661,11 +661,12 @@ ovsdb_storage_write_schema_change(struct ovsdb_storage
>> *storage,
>> return w;
>> }
>>
>> -const struct uuid *
>> -ovsdb_storage_peek_last_eid(struct ovsdb_storage *storage)
>> +bool
>> +ovsdb_storage_precheck_prereq(const struct ovsdb_storage *storage,
>> + const struct uuid *prereq)
>> {
>> if (!storage->raft) {
>> - return NULL;
>> + return true;
>> }
>> - return raft_current_eid(storage->raft);
>> + return raft_precheck_prereq(storage->raft, prereq);
>> }
>> diff --git a/ovsdb/storage.h b/ovsdb/storage.h
>> index 05f40ce93..7079ea261 100644
>> --- a/ovsdb/storage.h
>> +++ b/ovsdb/storage.h
>> @@ -96,6 +96,9 @@ struct ovsdb_storage *ovsdb_storage_open_standalone(const
>> char *filename,
>> bool rw);
>> struct ovsdb_schema *ovsdb_storage_read_schema(struct ovsdb_storage *);
>>
>> -const struct uuid *ovsdb_storage_peek_last_eid(struct ovsdb_storage *);
>> +/* Checks that there is a chance for a record with specified prerequisites
>> + * to be successfully written to the storage. */
>> +bool ovsdb_storage_precheck_prereq(const struct ovsdb_storage *,
>> + const struct uuid *prereq);
>>
>> #endif /* ovsdb/storage.h */
>> diff --git a/ovsdb/transaction.c b/ovsdb/transaction.c
>> index 484a88e1c..65eca6478 100644
>> --- a/ovsdb/transaction.c
>> +++ b/ovsdb/transaction.c
>> @@ -1277,11 +1277,7 @@ struct ovsdb_txn_progress {
>> bool
>> ovsdb_txn_precheck_prereq(const struct ovsdb *db)
>> {
>> - const struct uuid *eid = ovsdb_storage_peek_last_eid(db->storage);
>> - if (!eid) {
>> - return true;
>> - }
>> - return uuid_equals(&db->prereq, eid);
>> + return ovsdb_storage_precheck_prereq(db->storage, &db->prereq);
>> }
>
> Aside from the comment above, the patch looks good to me. If you
> address my comment.. or if you decide to dismiss it too :) then:
>
> Acked-by: Dumitru Ceara <[email protected]>
>
> This part is not related to the patch itself, but reviewing the change I
> was wondering:
>
> ovsdb_txn_precheck_prereq() is called in ovsdb_trigger_try(), where we
> return early if the prereq failed for the transaction. But
> ovsdb_trigger_try() is called by ovsdb_trigger_run() for each pending
> trigger. If the first trigger failed precheck is there any reason to
> not break early from the loop there?
The ovsdb_trigger_try() handles also transactions that are already
in progress, so we have to run it for those.
There is also an unrelated issue that we block even read-only transactions
today if there is an incomplete change in-flight. This applies to both
the leader and the followers since introduction of the pre-check.
We could think of some optimization that would allow read-only transactions
in this case, the problem is that in the current design we need to process
the transaction in order to know if it is read-only or not and that might
be expensive. However, if we find a way to avoid this somehow, we can
still execute some transactions while pre-check for others fails. So, I'd
like to keep the loop as it is for now.
>
> bool
> ovsdb_trigger_run(struct ovsdb *db, long long int now)
> {
> struct ovsdb_trigger *t;
>
> bool run_triggers = db->run_triggers;
> db->run_triggers_now = db->run_triggers = false;
>
> bool disconnect_all = false;
>
> LIST_FOR_EACH_SAFE (t, node, &db->triggers) {
> cooperative_multitasking_yield();
>
> if (run_triggers
> || now - t->created >= t->timeout_msec
> || t->progress || t->txn_forward) {
> if (ovsdb_trigger_try(t, now)) {
> disconnect_all = true;
> }
> }
> }
> return disconnect_all;
> }
>
> Regards,
> Dumitru
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev