Ilya Maximets <[email protected]> writes:

> ovn-kubernetes users observe uneven distribution of traffic among
> service backends.  It gets noticeably worse when 9+ backends are
> configured.  For example:
>
>     Server        connections
>     server-6nbg7     634
>     server-72br8     326  <-----
>     server-7b8rm     605
>     server-9tnz7     655
>     server-c84hw     596
>     server-mjskl     622
>     server-ptkcc     336  <-----
>     server-rl7pk     592
>     server-xd2cb     634
>
> Here we can see that out of 5000 connections total, servers '72br8'
> and 'ptkcc' received about 2 times less connections than any other
> server.  With 10 backends, 4 servers will receive 2x less, etc.
> The situation is common, because kubernetes users frequently scale
> their applications with something like kubectl scale --replicas=10.
>
> Services are implemented as OVN load-balancers, which are implemented
> as OpenFlow select groups with a dp_hash selection method.  Balancing
> is implemented by allocating a hash space (number of hashes is a
> power of 2) and distributing hashes between the group buckets using
> Webster's method taking into account bucket weights.
>
> This works well enough when buckets have different weights, because
> OVS needs to allocate larger hash space in order to accommodate the
> weight difference.  However, in case of OVN load-balancers, all the
> buckets (backends) have the same weight.  This leads to allocation
> of a smaller hash space just enough to fit all the backends (closest
> power of 2).  For example, if we have 10 backends (buckets), then
> 16 hashes will be allocated.  After hashes are distributed, we end
> up with 6 buckets having 2 hashes each and 4 buckets having 1 hash.
> So, each of the 6 buckets will receive on average 2x more traffic
> than each of the remaining 4.
>
> The problem can be mitigated by expanding the hash space and this is
> already done for cases with small number of buckets by limiting the
> hash space to have at least 16 hashes.  However, it is just not large
> enough for 9 or 10 buckets to provide a good distribution.  At the
> same time, blindly increasing the limit is also not a good solution
> as we do not want too large of a hash space for small number of
> buckets, simply because each hash is a separate datapath flow and
> having too many of them may cause performance issues.
>
> Approach taken in this change is to ensure that the hash space is
> at least 4 times larger than the number of buckets, but not larger
> than the maximum allowed (256).  This provides a better distribution
> while not unnecessarily exploding number of datapath flows for
> services with not that many backends.
>
> Here is some data to demonstrate why the 4 was chosen as a coefficient:
>
>   coeff.  :   1         2         3         4         5        100
>   -------------------------------------------------------------------
>   AvgDiff : 43.1 %    27.1 %    18.3 %    15.1 %    13.6 %    10.9 %
>   MaxDiff : 50.0 %    33.3 %    25.0 %    20.0 %    20.0 %    20.0 %
>   AvgDev  : 24.9 %    13.4 %     8.5 %     6.9 %     6.1 %     4.8 %
>   MaxDev  : 35.4 %    20.4 %    14.4 %    11.2 %    11.2 %    11.2 %
>   --------+----------------------------------------------------------
>     16    :   1         1         1         1         1         -
>     32    :  17         9         6         5         4         -
>     64    :  33        17        11         9         7         -
>    128    :  65        33        22        17        13         1
>    256    : 129        65        43        33        26         2
>   --------+----------------------------------------------------------
>            current                       proposed
>
> Table shows average and maximum load difference (Diff) between backends
> across groups with 1 to 64 equally weighted backends.  And it shows
> average and maximum standard deviation (Dev) of load distribution for
> the same.  For example, with a coefficient 2, the maximum difference
> between two backends will be 33% and the maximum standard deviation
> will be 20.4%.  With the current logic (coefficient of 1) we have
> maximum difference as high as 50%, as shown with the example at the
> beginning, with the standard deviation of 35.4%.
>
> The bottom half of the table shows from how many backends we start to
> use a particular number of buckets.  For example, with a coeff. 3
> we will have 16 hashes for 1 to 5 buckets, 32 hashes for 6-10, 64
> buckets for 11-21 and so on.
>
> According to the table, the number 4 is about where we achieve a good
> enough standard deviation for the load (11.2%) while still not creating
> too many hashes for cases with low number of backends.  The standard
> deviation also doesn't go down that much with higher coefficient.
>
> The data is aggregated for up to 64 backends because with higher
> numbers we're getting close to the cap of 256 hashes, so deviation
> increases.  However, the load difference with so many backends should
> have lower impact on a cluster in general.  The change improves the
> cases with 64+ backends, but to get very good numbers we'll need to
> consider moving the upper limit for the hash space, which is a separate
> change to think about.
>
> Added test checks that standard deviation of the load distribution
> between buckets is below 12.5% for all cases up to 64 buckets.  It's
> just a pretty number that I've chosen that should be IMO good enough.
>
> Note: when number of buckets is a power of 2, then we have an ideal
> distribution with zero deviation, because hashes can be evenly mapped
> to buckets.
>
> Note 2: The change doesn't affect distribution for 4 or less backends,
> since there are still minimum 16 hashes for those groups.
>
> Signed-off-by: Ilya Maximets <[email protected]>
> ---
>
> Not sure if this should be considered as a bug fix and be backported.
> Thoughts on this are welcome.

I think it should be okay to backport.  It isn't a new feature.

Acked-by: Aaron Conole <[email protected]>

>  ofproto/ofproto-dpif.c |  6 +++--
>  tests/ofproto-dpif.at  | 54 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 58 insertions(+), 2 deletions(-)
>
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index 4659d8a3b..bf43d5d4b 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -5341,8 +5341,10 @@ group_setup_dp_hash_table(struct group_dpif *group, 
> size_t max_hash)
>               min_weight, total_weight);
>  
>      uint64_t min_slots = DIV_ROUND_UP(total_weight, min_weight);
> -    uint64_t min_slots2 = ROUND_UP_POW2(min_slots);
> -    uint64_t n_hash = MAX(16, min_slots2);
> +    uint64_t min_slots2 =
> +        MAX(min_slots, MIN(n_buckets * 4, MAX_SELECT_GROUP_HASH_VALUES));
> +    uint64_t min_slots3 = ROUND_UP_POW2(min_slots2);
> +    uint64_t n_hash = MAX(16, min_slots3);
>      if (n_hash > MAX_SELECT_GROUP_HASH_VALUES ||
>          (max_hash != 0 && n_hash > max_hash)) {
>          VLOG_DBG("  Too many hash values required: %"PRIu64, n_hash);
> diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at
> index 12cb7f7a6..1c99dc3bb 100644
> --- a/tests/ofproto-dpif.at
> +++ b/tests/ofproto-dpif.at
> @@ -1238,6 +1238,60 @@ bucket3 >= 500
>  OVS_VSWITCHD_STOP
>  AT_CLEANUP
>  
> +AT_SETUP([ofproto-dpif - select group with dp_hash and equal weights])
> +
> +OVS_VSWITCHD_START
> +add_of_ports br0 1 10
> +
> +AT_CHECK([ovs-appctl vlog/set ofproto_dpif:file:dbg vconn:file:info])
> +
> +AT_DATA([stddev.awk], [
> +  {
> +    # $1 (target) is a mean value, because all weights are the same.
> +    # $2 (hits) is an actual number of hashes assigned to this bucket.
> +    n_hashes += $2
> +    n_buckets++
> +    sum_sq_diff += ($2 - $1) * ($2 - $1)
> +  }
> +  END {
> +    mean = n_hashes / n_buckets
> +    stddev = sqrt(sum_sq_diff / n_buckets)
> +    stddevp = stddev * 100 / mean
> +
> +    print "hashes:", n_hashes, "buckets:", n_buckets
> +    print "mean:", mean, "stddev:", stddev, "(", stddevp, "% )"
> +
> +    # Make sure that standard deviation of load between buckets is below 
> 12.5%.
> +    # Note: it's not a strict requirement, but a good number that passes 
> tests.
> +    if (stddevp <= 12.5) { print "PASS" }
> +    else { print "FAIL" }
> +  }
> +])
> +
> +m4_define([CHECK_DISTRIBUTION], [
> +  AT_CHECK([tail -n $1 ovs-vswitchd.log | grep 'ofproto_dpif|DBG|.*Bucket' \
> +                | sed 's/.*target=\([[0-9\.]]*\) hits=\([[0-9]]*\)/\1 \2/' \
> +                | awk -f stddev.awk], [0], [stdout])
> +  AT_CHECK([grep -q "buckets: $2" stdout])
> +  AT_CHECK([grep -q 'PASS' stdout])
> +])
> +
> +m4_define([OF_GROUP], [group_id=$1,type=select,selection_method=dp_hash])
> +m4_define([OFG_BUCKET], [bucket=weight=$1,output:10])
> +
> +dnl Test load distribution in groups with up to 64 equally weighted buckets.
> +m4_define([OFG_BUCKETS], [OFG_BUCKET(100)])
> +m4_for([id], [1], [64], [1], [
> +  get_log_next_line_num
> +  AT_CHECK([ovs-ofctl -O OpenFlow15 add-group br0 \
> +                "OF_GROUP(id),OFG_BUCKETS()"])
> +  CHECK_DISTRIBUTION([+$LINENUM], [id])
> +  m4_append([OFG_BUCKETS], [,OFG_BUCKET(100)])
> +])
> +
> +OVS_VSWITCHD_STOP
> +AT_CLEANUP
> +
>  AT_SETUP([ofproto-dpif - select group with explicit dp_hash selection 
> method])
>  
>  OVS_VSWITCHD_START

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to