On 6/27/24 00:02, Ilya Maximets wrote:
> Every transaction has RAFT log prerequisites.  Even if transactions
> are not related (because RAFT doesn't actually know what data it is
> handling).  When leader writes a new record to a RAFT storage, it is
> getting appended to the log right away and changes current 'eid',
> i.e., changes prerequisites.  The leader will not try to write new
> records until the current one is committed, because until then the
> pre-check will be failing.
> 
> However, that is different for the follower.  Followers do not add
> records to the RAFT log until the leader sends an append request back.
> So, if there are multiple transactions pending on a follower, it will
> create a command for each of them and prerequisites will be set to the
> same values.  All these commands will be sent to the leader, but only
> one can succeed at a time, because accepting one command immediately
> changes prerequisites and all other commands become non-applicable.
> So, out of N commands, 1 will succeed and N - 1 will fail.  The cluster
> failure is a transient failure, so the follower will re-process all the
> failed transactions and send them again.  1 will succeed and N - 2 will
> fail.  And so on, until there are no more transactions.  In the end,
> instead of processing N transactions, the follower is performing
> N * (N - 1) / 2 transaction processing iterations.  That is consuming
> a huge amount of CPU resources completely unnecessarily.
> 
> Since there is no real chance for multiple transactions from the same
> follower to succeed, it's better to not send them in the first place.
> This also eliminates prerequisite mismatch messages on a leader in
> this particular case.
> 

Makes sense!

> In a test with 30 parallel shell threads executing 12K transactions
> total with separate ovsdb-client calls through the same follower there
> is about 60% performance improvement.  The test takes ~100 seconds to
> complete without this change and ~40 seconds with this change applied.
> The new time is very close to what it takes to execute the same test
> through the cluster leader.
> 

Maybe a public link to the test (if possible) would be a nice reference
to have in the future?

> Note: prerequisite failures on a leader are still possible, but mostly
> in a case of simultaneous transactions from different followers.  It's
> a normal thing for a distributed database due to its nature.
> 
> Signed-off-by: Ilya Maximets <[email protected]>
> ---
>  ovsdb/raft.c        | 45 ++++++++++++++++++++++++++++++++++++++++++++-
>  ovsdb/raft.h        |  2 +-
>  ovsdb/storage.c     |  9 +++++----
>  ovsdb/storage.h     |  5 ++++-
>  ovsdb/transaction.c |  6 +-----
>  5 files changed, 55 insertions(+), 12 deletions(-)
> 
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index ac3d37ac4..116924058 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -2307,12 +2307,55 @@ raft_get_eid(const struct raft *raft, uint64_t index)
>      return &raft->snap.eid;
>  }
>  
> -const struct uuid *
> +static const struct uuid *
>  raft_current_eid(const struct raft *raft)
>  {
>      return raft_get_eid(raft, raft->log_end - 1);
>  }
>  
> +bool
> +raft_precheck_prereq(const struct raft *raft, const struct uuid *prereq)
> +{
> +    if (!uuid_equals(raft_current_eid(raft), prereq)) {
> +        VLOG_DBG("%s: prerequisites (" UUID_FMT ") "
> +                 "do not match current eid (" UUID_FMT ")",
> +                 __func__, UUID_ARGS(prereq),
> +                 UUID_ARGS(raft_current_eid(raft)));
> +        return false;
> +    }
> +
> +    /* Having incomplete commands on a follower means that the leader has
> +     * these commands and they will change the prerequisites once added to
> +     * the leader's log.
> +     *
> +     * There is a chance that all these commands will actually fail and the
> +     * record with current prerequisites will in fact succeed, but, since
> +     * these are our own commands, the chances are low.
> +     *
> +     * Incomplete commands on a leader will not change the leader's current
> +     * 'eid' on commit as they are already part of the leader's log. */
> +    if (raft->role != RAFT_LEADER && hmap_count(&raft->commands)) {
> +        struct raft_command *cmd;
> +
> +        HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {

Nit: I'd rewrite this as:

if (raft->role == RAFT_LEADER) {
    return true;
}

HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
    ...
}

> +            /* Skip commands that are already part of the log (have non-zero
> +             * index) and ones that do not carry any data (have zero 'eid'),
> +             * as they can't change prerequisites.
> +             *
> +             * Database will not re-run triggers unless the data changes or
> +             * one of the data-carrying triggers completes.  So, pre-check 
> must
> +             * not fail if there are no outstanding data-carrying commands. 
> */
> +            if (!cmd->index && !uuid_is_zero(&cmd->eid)) {
> +                VLOG_DBG("%s: follower still has an incomplete command "
> +                         UUID_FMT, __func__, UUID_ARGS(&cmd->eid));
> +                return false;
> +            }
> +        }
> +    }
> +
> +    return true;
> +}
> +
>  static struct raft_command *
>  raft_command_create_completed(enum raft_command_status status)
>  {
> diff --git a/ovsdb/raft.h b/ovsdb/raft.h
> index a5b55d9bf..5833aaf23 100644
> --- a/ovsdb/raft.h
> +++ b/ovsdb/raft.h
> @@ -189,5 +189,5 @@ struct ovsdb_error *raft_store_snapshot(struct raft *,
>  void raft_take_leadership(struct raft *);
>  void raft_transfer_leadership(struct raft *, const char *reason);
>  
> -const struct uuid *raft_current_eid(const struct raft *);
> +bool raft_precheck_prereq(const struct raft *, const struct uuid *prereq);
>  #endif /* lib/raft.h */
> diff --git a/ovsdb/storage.c b/ovsdb/storage.c
> index 6c395106c..c5aec5459 100644
> --- a/ovsdb/storage.c
> +++ b/ovsdb/storage.c
> @@ -661,11 +661,12 @@ ovsdb_storage_write_schema_change(struct ovsdb_storage 
> *storage,
>      return w;
>  }
>  
> -const struct uuid *
> -ovsdb_storage_peek_last_eid(struct ovsdb_storage *storage)
> +bool
> +ovsdb_storage_precheck_prereq(const struct ovsdb_storage *storage,
> +                              const struct uuid *prereq)
>  {
>      if (!storage->raft) {
> -        return NULL;
> +        return true;
>      }
> -    return raft_current_eid(storage->raft);
> +    return raft_precheck_prereq(storage->raft, prereq);
>  }
> diff --git a/ovsdb/storage.h b/ovsdb/storage.h
> index 05f40ce93..7079ea261 100644
> --- a/ovsdb/storage.h
> +++ b/ovsdb/storage.h
> @@ -96,6 +96,9 @@ struct ovsdb_storage *ovsdb_storage_open_standalone(const 
> char *filename,
>                                                      bool rw);
>  struct ovsdb_schema *ovsdb_storage_read_schema(struct ovsdb_storage *);
>  
> -const struct uuid *ovsdb_storage_peek_last_eid(struct ovsdb_storage *);
> +/* Checks that there is a chance for a record with specified prerequisites
> + * to be successfully written to the storage. */
> +bool ovsdb_storage_precheck_prereq(const struct ovsdb_storage *,
> +                                   const struct uuid *prereq);
>  
>  #endif /* ovsdb/storage.h */
> diff --git a/ovsdb/transaction.c b/ovsdb/transaction.c
> index 484a88e1c..65eca6478 100644
> --- a/ovsdb/transaction.c
> +++ b/ovsdb/transaction.c
> @@ -1277,11 +1277,7 @@ struct ovsdb_txn_progress {
>  bool
>  ovsdb_txn_precheck_prereq(const struct ovsdb *db)
>  {
> -    const struct uuid *eid = ovsdb_storage_peek_last_eid(db->storage);
> -    if (!eid) {
> -        return true;
> -    }
> -    return uuid_equals(&db->prereq, eid);
> +    return ovsdb_storage_precheck_prereq(db->storage, &db->prereq);
>  }

Aside from the comment above, the patch looks good to me.  If you
address my comment.. or if you decide to dismiss it too :) then:

Acked-by: Dumitru Ceara <[email protected]>

This part is not related to the patch itself, but reviewing the change I
was wondering:

ovsdb_txn_precheck_prereq() is called in ovsdb_trigger_try(), where we
return early if the prereq failed for the transaction.  But
ovsdb_trigger_try() is called by ovsdb_trigger_run() for each pending
trigger.  If the first trigger failed precheck is there any reason to
not break early from the loop there?

bool
ovsdb_trigger_run(struct ovsdb *db, long long int now)
{
    struct ovsdb_trigger *t;

    bool run_triggers = db->run_triggers;
    db->run_triggers_now = db->run_triggers = false;

    bool disconnect_all = false;

    LIST_FOR_EACH_SAFE (t, node, &db->triggers) {
        cooperative_multitasking_yield();

        if (run_triggers
            || now - t->created >= t->timeout_msec
            || t->progress || t->txn_forward) {
            if (ovsdb_trigger_try(t, now)) {
                disconnect_all = true;
            }
        }
    }
    return disconnect_all;
}

Regards,
Dumitru

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to