Re: [ovs-dev] [PATCH] ovsdb: raft: Don't forward more than one command to the leader.

Ilya Maximets Fri, 28 Jun 2024 02:44:17 -0700

On 6/28/24 09:20, Dumitru Ceara wrote:
> On 6/27/24 00:02, Ilya Maximets wrote:
>> Every transaction has RAFT log prerequisites.  Even if transactions
>> are not related (because RAFT doesn't actually know what data it is
>> handling).  When leader writes a new record to a RAFT storage, it is
>> getting appended to the log right away and changes current 'eid',
>> i.e., changes prerequisites.  The leader will not try to write new
>> records until the current one is committed, because until then the
>> pre-check will be failing.
>>
>> However, that is different for the follower.  Followers do not add
>> records to the RAFT log until the leader sends an append request back.
>> So, if there are multiple transactions pending on a follower, it will
>> create a command for each of them and prerequisites will be set to the
>> same values.  All these commands will be sent to the leader, but only
>> one can succeed at a time, because accepting one command immediately
>> changes prerequisites and all other commands become non-applicable.
>> So, out of N commands, 1 will succeed and N - 1 will fail.  The cluster
>> failure is a transient failure, so the follower will re-process all the
>> failed transactions and send them again.  1 will succeed and N - 2 will
>> fail.  And so on, until there are no more transactions.  In the end,
>> instead of processing N transactions, the follower is performing
>> N * (N - 1) / 2 transaction processing iterations.  That is consuming
>> a huge amount of CPU resources completely unnecessarily.
>>
>> Since there is no real chance for multiple transactions from the same
>> follower to succeed, it's better to not send them in the first place.
>> This also eliminates prerequisite mismatch messages on a leader in
>> this particular case.
>>
> 
> Makes sense!
> 
>> In a test with 30 parallel shell threads executing 12K transactions
>> total with separate ovsdb-client calls through the same follower there
>> is about 60% performance improvement.  The test takes ~100 seconds to
>> complete without this change and ~40 seconds with this change applied.
>> The new time is very close to what it takes to execute the same test
>> through the cluster leader.
>>
> 
> Maybe a public link to the test (if possible) would be a nice reference
> to have in the future?


I'm not sure where to post it, so I'll just leave it here, and I can
add a link to this post into the commit message.

The script is adding 12K ports across 120 logical switches in a style
of ovn-kubernetes (sets up port security, some external IDs and adds
to an address set):

---
$ cat ../transaction-benchmark-parallel.sh
#set -x
set -o errexit

# Creating 120 logical switches
for i in $(seq 120); do
    ovn-nbctl ls-add ip-10-10-177-$i.us-west-2.compute.internal
done
IFS=$'\r\n' GLOBIGNORE='*' \
    command eval  'uuids=($(ovn-nbctl --bare --columns _uuid list 
logical_switch | sed "/^$/d"))'

for i in $(seq 120); do
    echo "uuid #${i}: ${uuids[$(($i - 1))]}"
done

# Creating one addres set
as_name="a0123456789012345678"
as_uuid=$(ovn-nbctl create address_set name=${as_name})

# Adding ports
function add_400() {
    rm -rf txn_results$1.txt
    for i in $(seq $1 $(($1 + 399)) ); do
        echo $i
        
port_name="node-density-00000000-0000-0000-0000-000000000000_node-density-$i"
        namespace="node-density-00000000-0000-0000-0000-000000000000"
        ls_name="ip-10-10-177-$((${i} % 120 + 1)).us-west-2.compute.internal"
        uuid_name="row00000000_0000_0000_0000_000000000000"
        mac=$(printf '0a:58:0a:9b:%02x:%02x' $((${i} >> 8)) $((${i} & 255)))
        ip=$(printf "10.11.%d.%d\n" $((${i} / 255)) $((${i} % 255 + 1)))

        time ovsdb-client transact unix:$(pwd)/sandbox/nb2.ovsdb 
"[\"OVN_Northbound\",
        {
          \"uuid-name\":\"${uuid_name}\",
          \"row\":{
              \"name\":\"${port_name}\",
              \"options\":[\"map\",[[\"requested-chassis\",\"${ls_name}\"]]],
              \"addresses\":[\"set\",[\"${mac} ${ip}\"]],
              
\"external_ids\":[\"map\",[[\"pod\",\"true\"],[\"namespace\",\"${namespace}\"]]],
              \"port_security\":[\"set\",[\"${mac} ${ip}\"]]
            },
          \"op\":\"insert\", \"table\":\"Logical_Switch_Port\"
        },
        {
          \"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${uuids[$(($i % 120))]}\"]]],
          
\"mutations\":[[\"ports\",\"insert\",[\"set\",[[\"named-uuid\",\"${uuid_name}\"]]]]],
          \"op\":\"mutate\", \"table\":\"Logical_Switch\"
        },
        {
          \"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${as_uuid}\"]]],
          \"mutations\":[[\"addresses\",\"insert\",[\"set\",[\"${ip}\"]]]],
          \"op\":\"mutate\", \"table\":\"Address_Set\"
        }]" >> txn_results$1.txt
    done
    echo "done" >> txn_results$1.txt
}

rm -f txn_results*.txt

n=0
for i in $(seq 0 400 11600); do
    n=$(($n + 1))
    add_400 $i &
done

while [ $(grep 'done' txn_results*.txt | wc -l) != $n ]; do
    sleep 1
done
---

It is not the most generic script and it is designed for sandbox,
where nb1 is a leader by default:

# make sandbox SANDBOXFLAGS="--nbdb-model=clustered"
# pkill ovn-northd # to remove the noise
# time bash ../transaction-benchmark-parallel.sh

Note that the script doesn't perform a lot of correctness checks, so
it's better to examine the txn_results*.txt files, or check the
database content afterwards to be sure.

> 
>> Note: prerequisite failures on a leader are still possible, but mostly
>> in a case of simultaneous transactions from different followers.  It's
>> a normal thing for a distributed database due to its nature.
>>
>> Signed-off-by: Ilya Maximets <[email protected]>
>> ---
>>  ovsdb/raft.c        | 45 ++++++++++++++++++++++++++++++++++++++++++++-
>>  ovsdb/raft.h        |  2 +-
>>  ovsdb/storage.c     |  9 +++++----
>>  ovsdb/storage.h     |  5 ++++-
>>  ovsdb/transaction.c |  6 +-----
>>  5 files changed, 55 insertions(+), 12 deletions(-)
>>
>> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> index ac3d37ac4..116924058 100644
>> --- a/ovsdb/raft.c
>> +++ b/ovsdb/raft.c
>> @@ -2307,12 +2307,55 @@ raft_get_eid(const struct raft *raft, uint64_t index)
>>      return &raft->snap.eid;
>>  }
>>  
>> -const struct uuid *
>> +static const struct uuid *
>>  raft_current_eid(const struct raft *raft)
>>  {
>>      return raft_get_eid(raft, raft->log_end - 1);
>>  }
>>  
>> +bool
>> +raft_precheck_prereq(const struct raft *raft, const struct uuid *prereq)
>> +{
>> +    if (!uuid_equals(raft_current_eid(raft), prereq)) {
>> +        VLOG_DBG("%s: prerequisites (" UUID_FMT ") "
>> +                 "do not match current eid (" UUID_FMT ")",
>> +                 __func__, UUID_ARGS(prereq),
>> +                 UUID_ARGS(raft_current_eid(raft)));
>> +        return false;
>> +    }
>> +
>> +    /* Having incomplete commands on a follower means that the leader has
>> +     * these commands and they will change the prerequisites once added to
>> +     * the leader's log.
>> +     *
>> +     * There is a chance that all these commands will actually fail and the
>> +     * record with current prerequisites will in fact succeed, but, since
>> +     * these are our own commands, the chances are low.
>> +     *
>> +     * Incomplete commands on a leader will not change the leader's current
>> +     * 'eid' on commit as they are already part of the leader's log. */
>> +    if (raft->role != RAFT_LEADER && hmap_count(&raft->commands)) {
>> +        struct raft_command *cmd;
>> +
>> +        HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
> 
> Nit: I'd rewrite this as:
> 
> if (raft->role == RAFT_LEADER) {
>     return true;
> }
> 
> HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
>     ...
> }

Makes sense, I can do that con commit, unless there will be further comments.

> 
>> +            /* Skip commands that are already part of the log (have non-zero
>> +             * index) and ones that do not carry any data (have zero 'eid'),
>> +             * as they can't change prerequisites.
>> +             *
>> +             * Database will not re-run triggers unless the data changes or
>> +             * one of the data-carrying triggers completes.  So, pre-check 
>> must
>> +             * not fail if there are no outstanding data-carrying commands. 
>> */
>> +            if (!cmd->index && !uuid_is_zero(&cmd->eid)) {
>> +                VLOG_DBG("%s: follower still has an incomplete command "
>> +                         UUID_FMT, __func__, UUID_ARGS(&cmd->eid));
>> +                return false;
>> +            }
>> +        }
>> +    }
>> +
>> +    return true;
>> +}
>> +
>>  static struct raft_command *
>>  raft_command_create_completed(enum raft_command_status status)
>>  {
>> diff --git a/ovsdb/raft.h b/ovsdb/raft.h
>> index a5b55d9bf..5833aaf23 100644
>> --- a/ovsdb/raft.h
>> +++ b/ovsdb/raft.h
>> @@ -189,5 +189,5 @@ struct ovsdb_error *raft_store_snapshot(struct raft *,
>>  void raft_take_leadership(struct raft *);
>>  void raft_transfer_leadership(struct raft *, const char *reason);
>>  
>> -const struct uuid *raft_current_eid(const struct raft *);
>> +bool raft_precheck_prereq(const struct raft *, const struct uuid *prereq);
>>  #endif /* lib/raft.h */
>> diff --git a/ovsdb/storage.c b/ovsdb/storage.c
>> index 6c395106c..c5aec5459 100644
>> --- a/ovsdb/storage.c
>> +++ b/ovsdb/storage.c
>> @@ -661,11 +661,12 @@ ovsdb_storage_write_schema_change(struct ovsdb_storage 
>> *storage,
>>      return w;
>>  }
>>  
>> -const struct uuid *
>> -ovsdb_storage_peek_last_eid(struct ovsdb_storage *storage)
>> +bool
>> +ovsdb_storage_precheck_prereq(const struct ovsdb_storage *storage,
>> +                              const struct uuid *prereq)
>>  {
>>      if (!storage->raft) {
>> -        return NULL;
>> +        return true;
>>      }
>> -    return raft_current_eid(storage->raft);
>> +    return raft_precheck_prereq(storage->raft, prereq);
>>  }
>> diff --git a/ovsdb/storage.h b/ovsdb/storage.h
>> index 05f40ce93..7079ea261 100644
>> --- a/ovsdb/storage.h
>> +++ b/ovsdb/storage.h
>> @@ -96,6 +96,9 @@ struct ovsdb_storage *ovsdb_storage_open_standalone(const 
>> char *filename,
>>                                                      bool rw);
>>  struct ovsdb_schema *ovsdb_storage_read_schema(struct ovsdb_storage *);
>>  
>> -const struct uuid *ovsdb_storage_peek_last_eid(struct ovsdb_storage *);
>> +/* Checks that there is a chance for a record with specified prerequisites
>> + * to be successfully written to the storage. */
>> +bool ovsdb_storage_precheck_prereq(const struct ovsdb_storage *,
>> +                                   const struct uuid *prereq);
>>  
>>  #endif /* ovsdb/storage.h */
>> diff --git a/ovsdb/transaction.c b/ovsdb/transaction.c
>> index 484a88e1c..65eca6478 100644
>> --- a/ovsdb/transaction.c
>> +++ b/ovsdb/transaction.c
>> @@ -1277,11 +1277,7 @@ struct ovsdb_txn_progress {
>>  bool
>>  ovsdb_txn_precheck_prereq(const struct ovsdb *db)
>>  {
>> -    const struct uuid *eid = ovsdb_storage_peek_last_eid(db->storage);
>> -    if (!eid) {
>> -        return true;
>> -    }
>> -    return uuid_equals(&db->prereq, eid);
>> +    return ovsdb_storage_precheck_prereq(db->storage, &db->prereq);
>>  }
> 
> Aside from the comment above, the patch looks good to me.  If you
> address my comment.. or if you decide to dismiss it too :) then:
> 
> Acked-by: Dumitru Ceara <[email protected]>
> 
> This part is not related to the patch itself, but reviewing the change I
> was wondering:
> 
> ovsdb_txn_precheck_prereq() is called in ovsdb_trigger_try(), where we
> return early if the prereq failed for the transaction.  But
> ovsdb_trigger_try() is called by ovsdb_trigger_run() for each pending
> trigger.  If the first trigger failed precheck is there any reason to
> not break early from the loop there?

The ovsdb_trigger_try() handles also transactions that are already
in progress, so we have to run it for those.

There is also an unrelated issue that we block even read-only transactions
today if there is an incomplete change in-flight.  This applies to both
the leader and the followers since introduction of the pre-check.
We could think of some optimization that would allow read-only transactions
in this case, the problem is that in the current design we need to process
the transaction in order to know if it is read-only or not and that might
be expensive.  However, if we find a way to avoid this somehow, we can
still execute some transactions while pre-check for others fails.  So, I'd
like to keep the loop as it is for now.

> 
> bool
> ovsdb_trigger_run(struct ovsdb *db, long long int now)
> {
>     struct ovsdb_trigger *t;
> 
>     bool run_triggers = db->run_triggers;
>     db->run_triggers_now = db->run_triggers = false;
> 
>     bool disconnect_all = false;
> 
>     LIST_FOR_EACH_SAFE (t, node, &db->triggers) {
>         cooperative_multitasking_yield();
> 
>         if (run_triggers
>             || now - t->created >= t->timeout_msec
>             || t->progress || t->txn_forward) {
>             if (ovsdb_trigger_try(t, now)) {
>                 disconnect_all = true;
>             }
>         }
>     }
>     return disconnect_all;
> }
> 
> Regards,
> Dumitru
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] ovsdb: raft: Don't forward more than one command to the leader.

Reply via email to