On 20 Sep 2024, at 14:40, Ilya Maximets wrote:
> ovn-kubernetes users observe uneven distribution of traffic among
> service backends. It gets noticeably worse when 9+ backends are
> configured. For example:
>
> Server connections
> server-6nbg7 634
> server-72br8 326 <-----
> server-7b8rm 605
> server-9tnz7 655
> server-c84hw 596
> server-mjskl 622
> server-ptkcc 336 <-----
> server-rl7pk 592
> server-xd2cb 634
>
> Here we can see that out of 5000 connections total, servers '72br8'
> and 'ptkcc' received about 2 times less connections than any other
> server. With 10 backends, 4 servers will receive 2x less, etc.
> The situation is common, because kubernetes users frequently scale
> their applications with something like kubectl scale --replicas=10.
>
> Services are implemented as OVN load-balancers, which are implemented
> as OpenFlow select groups with a dp_hash selection method. Balancing
> is implemented by allocating a hash space (number of hashes is a
> power of 2) and distributing hashes between the group buckets using
> Webster's method taking into account bucket weights.
>
> This works well enough when buckets have different weights, because
> OVS needs to allocate larger hash space in order to accommodate the
> weight difference. However, in case of OVN load-balancers, all the
> buckets (backends) have the same weight. This leads to allocation
> of a smaller hash space just enough to fit all the backends (closest
> power of 2). For example, if we have 10 backends (buckets), then
> 16 hashes will be allocated. After hashes are distributed, we end
> up with 6 buckets having 2 hashes each and 4 buckets having 1 hash.
> So, each of the 6 buckets will receive on average 2x more traffic
> than each of the remaining 4.
>
> The problem can be mitigated by expanding the hash space and this is
> already done for cases with small number of buckets by limiting the
> hash space to have at least 16 hashes. However, it is just not large
> enough for 9 or 10 buckets to provide a good distribution. At the
> same time, blindly increasing the limit is also not a good solution
> as we do not want too large of a hash space for small number of
> buckets, simply because each hash is a separate datapath flow and
> having too many of them may cause performance issues.
>
> Approach taken in this change is to ensure that the hash space is
> at least 4 times larger than the number of buckets, but not larger
> than the maximum allowed (256). This provides a better distribution
> while not unnecessarily exploding number of datapath flows for
> services with not that many backends.
>
> Here is some data to demonstrate why the 4 was chosen as a coefficient:
>
> coeff. : 1 2 3 4 5 100
> -------------------------------------------------------------------
> AvgDiff : 43.1 % 27.1 % 18.3 % 15.1 % 13.6 % 10.9 %
> MaxDiff : 50.0 % 33.3 % 25.0 % 20.0 % 20.0 % 20.0 %
> AvgDev : 24.9 % 13.4 % 8.5 % 6.9 % 6.1 % 4.8 %
> MaxDev : 35.4 % 20.4 % 14.4 % 11.2 % 11.2 % 11.2 %
> --------+----------------------------------------------------------
> 16 : 1 1 1 1 1 -
> 32 : 17 9 6 5 4 -
> 64 : 33 17 11 9 7 -
> 128 : 65 33 22 17 13 1
> 256 : 129 65 43 33 26 2
> --------+----------------------------------------------------------
> current proposed
>
> Table shows average and maximum load difference (Diff) between backends
> across groups with 1 to 64 equally weighted backends. And it shows
> average and maximum standard deviation (Dev) of load distribution for
> the same. For example, with a coefficient 2, the maximum difference
> between two backends will be 33% and the maximum standard deviation
> will be 20.4%. With the current logic (coefficient of 1) we have
> maximum difference as high as 50%, as shown with the example at the
> beginning, with the standard deviation of 35.4%.
>
> The bottom half of the table shows from how many backends we start to
> use a particular number of buckets. For example, with a coeff. 3
> we will have 16 hashes for 1 to 5 buckets, 32 hashes for 6-10, 64
> buckets for 11-21 and so on.
>
> According to the table, the number 4 is about where we achieve a good
> enough standard deviation for the load (11.2%) while still not creating
> too many hashes for cases with low number of backends. The standard
> deviation also doesn't go down that much with higher coefficient.
>
> The data is aggregated for up to 64 backends because with higher
> numbers we're getting close to the cap of 256 hashes, so deviation
> increases. However, the load difference with so many backends should
> have lower impact on a cluster in general. The change improves the
> cases with 64+ backends, but to get very good numbers we'll need to
> consider moving the upper limit for the hash space, which is a separate
> change to think about.
>
> Added test checks that standard deviation of the load distribution
> between buckets is below 12.5% for all cases up to 64 buckets. It's
> just a pretty number that I've chosen that should be IMO good enough.
>
> Note: when number of buckets is a power of 2, then we have an ideal
> distribution with zero deviation, because hashes can be evenly mapped
> to buckets.
>
> Note 2: The change doesn't affect distribution for 4 or less backends,
> since there are still minimum 16 hashes for those groups.
>
> Signed-off-by: Ilya Maximets <[email protected]>
Thanks for the patch and the extensive log message description.
The change looks good to me. Did some basic testing, but I do not see how this
would break any real traffic.
Cheers,
Eelco
Acked-by: Eelco Chaudron <[email protected]>
> ---
>
> Not sure if this should be considered as a bug fix and be backported.
> Thoughts on this are welcome.
>
> ofproto/ofproto-dpif.c | 6 +++--
> tests/ofproto-dpif.at | 54 ++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 58 insertions(+), 2 deletions(-)
>
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index 4659d8a3b..bf43d5d4b 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -5341,8 +5341,10 @@ group_setup_dp_hash_table(struct group_dpif *group,
> size_t max_hash)
> min_weight, total_weight);
>
> uint64_t min_slots = DIV_ROUND_UP(total_weight, min_weight);
> - uint64_t min_slots2 = ROUND_UP_POW2(min_slots);
> - uint64_t n_hash = MAX(16, min_slots2);
> + uint64_t min_slots2 =
> + MAX(min_slots, MIN(n_buckets * 4, MAX_SELECT_GROUP_HASH_VALUES));
> + uint64_t min_slots3 = ROUND_UP_POW2(min_slots2);
> + uint64_t n_hash = MAX(16, min_slots3);
> if (n_hash > MAX_SELECT_GROUP_HASH_VALUES ||
> (max_hash != 0 && n_hash > max_hash)) {
> VLOG_DBG(" Too many hash values required: %"PRIu64, n_hash);
> diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at
> index 12cb7f7a6..1c99dc3bb 100644
> --- a/tests/ofproto-dpif.at
> +++ b/tests/ofproto-dpif.at
> @@ -1238,6 +1238,60 @@ bucket3 >= 500
> OVS_VSWITCHD_STOP
> AT_CLEANUP
>
> +AT_SETUP([ofproto-dpif - select group with dp_hash and equal weights])
> +
> +OVS_VSWITCHD_START
> +add_of_ports br0 1 10
> +
> +AT_CHECK([ovs-appctl vlog/set ofproto_dpif:file:dbg vconn:file:info])
> +
> +AT_DATA([stddev.awk], [
> + {
> + # $1 (target) is a mean value, because all weights are the same.
> + # $2 (hits) is an actual number of hashes assigned to this bucket.
> + n_hashes += $2
> + n_buckets++
> + sum_sq_diff += ($2 - $1) * ($2 - $1)
> + }
> + END {
> + mean = n_hashes / n_buckets
> + stddev = sqrt(sum_sq_diff / n_buckets)
> + stddevp = stddev * 100 / mean
> +
> + print "hashes:", n_hashes, "buckets:", n_buckets
> + print "mean:", mean, "stddev:", stddev, "(", stddevp, "% )"
> +
> + # Make sure that standard deviation of load between buckets is below
> 12.5%.
> + # Note: it's not a strict requirement, but a good number that passes
> tests.
> + if (stddevp <= 12.5) { print "PASS" }
> + else { print "FAIL" }
> + }
> +])
> +
> +m4_define([CHECK_DISTRIBUTION], [
> + AT_CHECK([tail -n $1 ovs-vswitchd.log | grep 'ofproto_dpif|DBG|.*Bucket' \
> + | sed 's/.*target=\([[0-9\.]]*\) hits=\([[0-9]]*\)/\1 \2/' \
> + | awk -f stddev.awk], [0], [stdout])
> + AT_CHECK([grep -q "buckets: $2" stdout])
> + AT_CHECK([grep -q 'PASS' stdout])
> +])
> +
> +m4_define([OF_GROUP], [group_id=$1,type=select,selection_method=dp_hash])
> +m4_define([OFG_BUCKET], [bucket=weight=$1,output:10])
> +
> +dnl Test load distribution in groups with up to 64 equally weighted buckets.
> +m4_define([OFG_BUCKETS], [OFG_BUCKET(100)])
> +m4_for([id], [1], [64], [1], [
> + get_log_next_line_num
> + AT_CHECK([ovs-ofctl -O OpenFlow15 add-group br0 \
> + "OF_GROUP(id),OFG_BUCKETS()"])
> + CHECK_DISTRIBUTION([+$LINENUM], [id])
> + m4_append([OFG_BUCKETS], [,OFG_BUCKET(100)])
> +])
> +
> +OVS_VSWITCHD_STOP
> +AT_CLEANUP
> +
> AT_SETUP([ofproto-dpif - select group with explicit dp_hash selection
> method])
>
> OVS_VSWITCHD_START
> --
> 2.46.0
>
> _______________________________________________
> dev mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev