On 6/28/24 11:43, Ilya Maximets wrote:
> On 6/28/24 09:20, Dumitru Ceara wrote:
>> On 6/27/24 00:02, Ilya Maximets wrote:
>>> Every transaction has RAFT log prerequisites. Even if transactions
>>> are not related (because RAFT doesn't actually know what data it is
>>> handling). When leader writes a new record to a RAFT storage, it is
>>> getting appended to the log right away and changes current 'eid',
>>> i.e., changes prerequisites. The leader will not try to write new
>>> records until the current one is committed, because until then the
>>> pre-check will be failing.
>>>
>>> However, that is different for the follower. Followers do not add
>>> records to the RAFT log until the leader sends an append request back.
>>> So, if there are multiple transactions pending on a follower, it will
>>> create a command for each of them and prerequisites will be set to the
>>> same values. All these commands will be sent to the leader, but only
>>> one can succeed at a time, because accepting one command immediately
>>> changes prerequisites and all other commands become non-applicable.
>>> So, out of N commands, 1 will succeed and N - 1 will fail. The cluster
>>> failure is a transient failure, so the follower will re-process all the
>>> failed transactions and send them again. 1 will succeed and N - 2 will
>>> fail. And so on, until there are no more transactions. In the end,
>>> instead of processing N transactions, the follower is performing
>>> N * (N - 1) / 2 transaction processing iterations. That is consuming
>>> a huge amount of CPU resources completely unnecessarily.
>>>
>>> Since there is no real chance for multiple transactions from the same
>>> follower to succeed, it's better to not send them in the first place.
>>> This also eliminates prerequisite mismatch messages on a leader in
>>> this particular case.
>>>
>>
>> Makes sense!
>>
>>> In a test with 30 parallel shell threads executing 12K transactions
>>> total with separate ovsdb-client calls through the same follower there
>>> is about 60% performance improvement. The test takes ~100 seconds to
>>> complete without this change and ~40 seconds with this change applied.
>>> The new time is very close to what it takes to execute the same test
>>> through the cluster leader.
>>>
>>
>> Maybe a public link to the test (if possible) would be a nice reference
>> to have in the future?
>
> I'm not sure where to post it, so I'll just leave it here, and I can
> add a link to this post into the commit message.
>
> The script is adding 12K ports across 120 logical switches in a style
> of ovn-kubernetes (sets up port security, some external IDs and adds
> to an address set):
>
> ---
> $ cat ../transaction-benchmark-parallel.sh
> #set -x
> set -o errexit
>
> # Creating 120 logical switches
> for i in $(seq 120); do
> ovn-nbctl ls-add ip-10-10-177-$i.us-west-2.compute.internal
> done
> IFS=$'\r\n' GLOBIGNORE='*' \
> command eval 'uuids=($(ovn-nbctl --bare --columns _uuid list
> logical_switch | sed "/^$/d"))'
>
> for i in $(seq 120); do
> echo "uuid #${i}: ${uuids[$(($i - 1))]}"
> done
>
> # Creating one addres set
> as_name="a0123456789012345678"
> as_uuid=$(ovn-nbctl create address_set name=${as_name})
>
> # Adding ports
> function add_400() {
> rm -rf txn_results$1.txt
> for i in $(seq $1 $(($1 + 399)) ); do
> echo $i
>
> port_name="node-density-00000000-0000-0000-0000-000000000000_node-density-$i"
> namespace="node-density-00000000-0000-0000-0000-000000000000"
> ls_name="ip-10-10-177-$((${i} % 120 + 1)).us-west-2.compute.internal"
> uuid_name="row00000000_0000_0000_0000_000000000000"
> mac=$(printf '0a:58:0a:9b:%02x:%02x' $((${i} >> 8)) $((${i} & 255)))
> ip=$(printf "10.11.%d.%d\n" $((${i} / 255)) $((${i} % 255 + 1)))
>
> time ovsdb-client transact unix:$(pwd)/sandbox/nb2.ovsdb
> "[\"OVN_Northbound\",
> {
> \"uuid-name\":\"${uuid_name}\",
> \"row\":{
> \"name\":\"${port_name}\",
> \"options\":[\"map\",[[\"requested-chassis\",\"${ls_name}\"]]],
> \"addresses\":[\"set\",[\"${mac} ${ip}\"]],
>
> \"external_ids\":[\"map\",[[\"pod\",\"true\"],[\"namespace\",\"${namespace}\"]]],
> \"port_security\":[\"set\",[\"${mac} ${ip}\"]]
> },
> \"op\":\"insert\", \"table\":\"Logical_Switch_Port\"
> },
> {
> \"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${uuids[$(($i %
> 120))]}\"]]],
>
> \"mutations\":[[\"ports\",\"insert\",[\"set\",[[\"named-uuid\",\"${uuid_name}\"]]]]],
> \"op\":\"mutate\", \"table\":\"Logical_Switch\"
> },
> {
> \"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${as_uuid}\"]]],
> \"mutations\":[[\"addresses\",\"insert\",[\"set\",[\"${ip}\"]]]],
> \"op\":\"mutate\", \"table\":\"Address_Set\"
> }]" >> txn_results$1.txt
> done
> echo "done" >> txn_results$1.txt
> }
>
> rm -f txn_results*.txt
>
> n=0
> for i in $(seq 0 400 11600); do
> n=$(($n + 1))
> add_400 $i &
> done
>
> while [ $(grep 'done' txn_results*.txt | wc -l) != $n ]; do
> sleep 1
> done
> ---
>
> It is not the most generic script and it is designed for sandbox,
> where nb1 is a leader by default:
>
> # make sandbox SANDBOXFLAGS="--nbdb-model=clustered"
> # pkill ovn-northd # to remove the noise
> # time bash ../transaction-benchmark-parallel.sh
>
> Note that the script doesn't perform a lot of correctness checks, so
> it's better to examine the txn_results*.txt files, or check the
> database content afterwards to be sure.
>
>>
>>> Note: prerequisite failures on a leader are still possible, but mostly
>>> in a case of simultaneous transactions from different followers. It's
>>> a normal thing for a distributed database due to its nature.
>>>
>>> Signed-off-by: Ilya Maximets <[email protected]>
>>> ---
>>> ovsdb/raft.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
>>> ovsdb/raft.h | 2 +-
>>> ovsdb/storage.c | 9 +++++----
>>> ovsdb/storage.h | 5 ++++-
>>> ovsdb/transaction.c | 6 +-----
>>> 5 files changed, 55 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>>> index ac3d37ac4..116924058 100644
>>> --- a/ovsdb/raft.c
>>> +++ b/ovsdb/raft.c
>>> @@ -2307,12 +2307,55 @@ raft_get_eid(const struct raft *raft, uint64_t
>>> index)
>>> return &raft->snap.eid;
>>> }
>>>
>>> -const struct uuid *
>>> +static const struct uuid *
>>> raft_current_eid(const struct raft *raft)
>>> {
>>> return raft_get_eid(raft, raft->log_end - 1);
>>> }
>>>
>>> +bool
>>> +raft_precheck_prereq(const struct raft *raft, const struct uuid *prereq)
>>> +{
>>> + if (!uuid_equals(raft_current_eid(raft), prereq)) {
>>> + VLOG_DBG("%s: prerequisites (" UUID_FMT ") "
>>> + "do not match current eid (" UUID_FMT ")",
>>> + __func__, UUID_ARGS(prereq),
>>> + UUID_ARGS(raft_current_eid(raft)));
>>> + return false;
>>> + }
>>> +
>>> + /* Having incomplete commands on a follower means that the leader has
>>> + * these commands and they will change the prerequisites once added to
>>> + * the leader's log.
>>> + *
>>> + * There is a chance that all these commands will actually fail and the
>>> + * record with current prerequisites will in fact succeed, but, since
>>> + * these are our own commands, the chances are low.
>>> + *
>>> + * Incomplete commands on a leader will not change the leader's current
>>> + * 'eid' on commit as they are already part of the leader's log. */
>>> + if (raft->role != RAFT_LEADER && hmap_count(&raft->commands)) {
>>> + struct raft_command *cmd;
>>> +
>>> + HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
>>
>> Nit: I'd rewrite this as:
>>
>> if (raft->role == RAFT_LEADER) {
>> return true;
>> }
>>
>> HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
>> ...
>> }
>
> Makes sense, I can do that con commit, unless there will be further comments.
>
>>
>>> + /* Skip commands that are already part of the log (have
>>> non-zero
>>> + * index) and ones that do not carry any data (have zero
>>> 'eid'),
>>> + * as they can't change prerequisites.
>>> + *
>>> + * Database will not re-run triggers unless the data changes or
>>> + * one of the data-carrying triggers completes. So, pre-check
>>> must
>>> + * not fail if there are no outstanding data-carrying
>>> commands. */
>>> + if (!cmd->index && !uuid_is_zero(&cmd->eid)) {
>>> + VLOG_DBG("%s: follower still has an incomplete command "
>>> + UUID_FMT, __func__, UUID_ARGS(&cmd->eid));
>>> + return false;
>>> + }
>>> + }
>>> + }
>>> +
>>> + return true;
>>> +}
>>> +
>>> static struct raft_command *
>>> raft_command_create_completed(enum raft_command_status status)
>>> {
>>> diff --git a/ovsdb/raft.h b/ovsdb/raft.h
>>> index a5b55d9bf..5833aaf23 100644
>>> --- a/ovsdb/raft.h
>>> +++ b/ovsdb/raft.h
>>> @@ -189,5 +189,5 @@ struct ovsdb_error *raft_store_snapshot(struct raft *,
>>> void raft_take_leadership(struct raft *);
>>> void raft_transfer_leadership(struct raft *, const char *reason);
>>>
>>> -const struct uuid *raft_current_eid(const struct raft *);
>>> +bool raft_precheck_prereq(const struct raft *, const struct uuid *prereq);
>>> #endif /* lib/raft.h */
>>> diff --git a/ovsdb/storage.c b/ovsdb/storage.c
>>> index 6c395106c..c5aec5459 100644
>>> --- a/ovsdb/storage.c
>>> +++ b/ovsdb/storage.c
>>> @@ -661,11 +661,12 @@ ovsdb_storage_write_schema_change(struct
>>> ovsdb_storage *storage,
>>> return w;
>>> }
>>>
>>> -const struct uuid *
>>> -ovsdb_storage_peek_last_eid(struct ovsdb_storage *storage)
>>> +bool
>>> +ovsdb_storage_precheck_prereq(const struct ovsdb_storage *storage,
>>> + const struct uuid *prereq)
>>> {
>>> if (!storage->raft) {
>>> - return NULL;
>>> + return true;
>>> }
>>> - return raft_current_eid(storage->raft);
>>> + return raft_precheck_prereq(storage->raft, prereq);
>>> }
>>> diff --git a/ovsdb/storage.h b/ovsdb/storage.h
>>> index 05f40ce93..7079ea261 100644
>>> --- a/ovsdb/storage.h
>>> +++ b/ovsdb/storage.h
>>> @@ -96,6 +96,9 @@ struct ovsdb_storage *ovsdb_storage_open_standalone(const
>>> char *filename,
>>> bool rw);
>>> struct ovsdb_schema *ovsdb_storage_read_schema(struct ovsdb_storage *);
>>>
>>> -const struct uuid *ovsdb_storage_peek_last_eid(struct ovsdb_storage *);
>>> +/* Checks that there is a chance for a record with specified prerequisites
>>> + * to be successfully written to the storage. */
>>> +bool ovsdb_storage_precheck_prereq(const struct ovsdb_storage *,
>>> + const struct uuid *prereq);
>>>
>>> #endif /* ovsdb/storage.h */
>>> diff --git a/ovsdb/transaction.c b/ovsdb/transaction.c
>>> index 484a88e1c..65eca6478 100644
>>> --- a/ovsdb/transaction.c
>>> +++ b/ovsdb/transaction.c
>>> @@ -1277,11 +1277,7 @@ struct ovsdb_txn_progress {
>>> bool
>>> ovsdb_txn_precheck_prereq(const struct ovsdb *db)
>>> {
>>> - const struct uuid *eid = ovsdb_storage_peek_last_eid(db->storage);
>>> - if (!eid) {
>>> - return true;
>>> - }
>>> - return uuid_equals(&db->prereq, eid);
>>> + return ovsdb_storage_precheck_prereq(db->storage, &db->prereq);
>>> }
>>
>> Aside from the comment above, the patch looks good to me. If you
>> address my comment.. or if you decide to dismiss it too :) then:
>>
>> Acked-by: Dumitru Ceara <[email protected]>
I made the change above and applied the patch. Thanks!
Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev