Ilya Maximets <[email protected]> writes: > ovn-kubernetes users observe uneven distribution of traffic among > service backends. It gets noticeably worse when 9+ backends are > configured. For example: > > Server connections > server-6nbg7 634 > server-72br8 326 <----- > server-7b8rm 605 > server-9tnz7 655 > server-c84hw 596 > server-mjskl 622 > server-ptkcc 336 <----- > server-rl7pk 592 > server-xd2cb 634 > > Here we can see that out of 5000 connections total, servers '72br8' > and 'ptkcc' received about 2 times less connections than any other > server. With 10 backends, 4 servers will receive 2x less, etc. > The situation is common, because kubernetes users frequently scale > their applications with something like kubectl scale --replicas=10. > > Services are implemented as OVN load-balancers, which are implemented > as OpenFlow select groups with a dp_hash selection method. Balancing > is implemented by allocating a hash space (number of hashes is a > power of 2) and distributing hashes between the group buckets using > Webster's method taking into account bucket weights. > > This works well enough when buckets have different weights, because > OVS needs to allocate larger hash space in order to accommodate the > weight difference. However, in case of OVN load-balancers, all the > buckets (backends) have the same weight. This leads to allocation > of a smaller hash space just enough to fit all the backends (closest > power of 2). For example, if we have 10 backends (buckets), then > 16 hashes will be allocated. After hashes are distributed, we end > up with 6 buckets having 2 hashes each and 4 buckets having 1 hash. > So, each of the 6 buckets will receive on average 2x more traffic > than each of the remaining 4. > > The problem can be mitigated by expanding the hash space and this is > already done for cases with small number of buckets by limiting the > hash space to have at least 16 hashes. However, it is just not large > enough for 9 or 10 buckets to provide a good distribution. At the > same time, blindly increasing the limit is also not a good solution > as we do not want too large of a hash space for small number of > buckets, simply because each hash is a separate datapath flow and > having too many of them may cause performance issues. > > Approach taken in this change is to ensure that the hash space is > at least 4 times larger than the number of buckets, but not larger > than the maximum allowed (256). This provides a better distribution > while not unnecessarily exploding number of datapath flows for > services with not that many backends. > > Here is some data to demonstrate why the 4 was chosen as a coefficient: > > coeff. : 1 2 3 4 5 100 > ------------------------------------------------------------------- > AvgDiff : 43.1 % 27.1 % 18.3 % 15.1 % 13.6 % 10.9 % > MaxDiff : 50.0 % 33.3 % 25.0 % 20.0 % 20.0 % 20.0 % > AvgDev : 24.9 % 13.4 % 8.5 % 6.9 % 6.1 % 4.8 % > MaxDev : 35.4 % 20.4 % 14.4 % 11.2 % 11.2 % 11.2 % > --------+---------------------------------------------------------- > 16 : 1 1 1 1 1 - > 32 : 17 9 6 5 4 - > 64 : 33 17 11 9 7 - > 128 : 65 33 22 17 13 1 > 256 : 129 65 43 33 26 2 > --------+---------------------------------------------------------- > current proposed > > Table shows average and maximum load difference (Diff) between backends > across groups with 1 to 64 equally weighted backends. And it shows > average and maximum standard deviation (Dev) of load distribution for > the same. For example, with a coefficient 2, the maximum difference > between two backends will be 33% and the maximum standard deviation > will be 20.4%. With the current logic (coefficient of 1) we have > maximum difference as high as 50%, as shown with the example at the > beginning, with the standard deviation of 35.4%. > > The bottom half of the table shows from how many backends we start to > use a particular number of buckets. For example, with a coeff. 3 > we will have 16 hashes for 1 to 5 buckets, 32 hashes for 6-10, 64 > buckets for 11-21 and so on. > > According to the table, the number 4 is about where we achieve a good > enough standard deviation for the load (11.2%) while still not creating > too many hashes for cases with low number of backends. The standard > deviation also doesn't go down that much with higher coefficient. > > The data is aggregated for up to 64 backends because with higher > numbers we're getting close to the cap of 256 hashes, so deviation > increases. However, the load difference with so many backends should > have lower impact on a cluster in general. The change improves the > cases with 64+ backends, but to get very good numbers we'll need to > consider moving the upper limit for the hash space, which is a separate > change to think about. > > Added test checks that standard deviation of the load distribution > between buckets is below 12.5% for all cases up to 64 buckets. It's > just a pretty number that I've chosen that should be IMO good enough. > > Note: when number of buckets is a power of 2, then we have an ideal > distribution with zero deviation, because hashes can be evenly mapped > to buckets. > > Note 2: The change doesn't affect distribution for 4 or less backends, > since there are still minimum 16 hashes for those groups. > > Signed-off-by: Ilya Maximets <[email protected]> > --- > > Not sure if this should be considered as a bug fix and be backported. > Thoughts on this are welcome.
I think it should be okay to backport. It isn't a new feature. Acked-by: Aaron Conole <[email protected]> > ofproto/ofproto-dpif.c | 6 +++-- > tests/ofproto-dpif.at | 54 ++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 58 insertions(+), 2 deletions(-) > > diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c > index 4659d8a3b..bf43d5d4b 100644 > --- a/ofproto/ofproto-dpif.c > +++ b/ofproto/ofproto-dpif.c > @@ -5341,8 +5341,10 @@ group_setup_dp_hash_table(struct group_dpif *group, > size_t max_hash) > min_weight, total_weight); > > uint64_t min_slots = DIV_ROUND_UP(total_weight, min_weight); > - uint64_t min_slots2 = ROUND_UP_POW2(min_slots); > - uint64_t n_hash = MAX(16, min_slots2); > + uint64_t min_slots2 = > + MAX(min_slots, MIN(n_buckets * 4, MAX_SELECT_GROUP_HASH_VALUES)); > + uint64_t min_slots3 = ROUND_UP_POW2(min_slots2); > + uint64_t n_hash = MAX(16, min_slots3); > if (n_hash > MAX_SELECT_GROUP_HASH_VALUES || > (max_hash != 0 && n_hash > max_hash)) { > VLOG_DBG(" Too many hash values required: %"PRIu64, n_hash); > diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at > index 12cb7f7a6..1c99dc3bb 100644 > --- a/tests/ofproto-dpif.at > +++ b/tests/ofproto-dpif.at > @@ -1238,6 +1238,60 @@ bucket3 >= 500 > OVS_VSWITCHD_STOP > AT_CLEANUP > > +AT_SETUP([ofproto-dpif - select group with dp_hash and equal weights]) > + > +OVS_VSWITCHD_START > +add_of_ports br0 1 10 > + > +AT_CHECK([ovs-appctl vlog/set ofproto_dpif:file:dbg vconn:file:info]) > + > +AT_DATA([stddev.awk], [ > + { > + # $1 (target) is a mean value, because all weights are the same. > + # $2 (hits) is an actual number of hashes assigned to this bucket. > + n_hashes += $2 > + n_buckets++ > + sum_sq_diff += ($2 - $1) * ($2 - $1) > + } > + END { > + mean = n_hashes / n_buckets > + stddev = sqrt(sum_sq_diff / n_buckets) > + stddevp = stddev * 100 / mean > + > + print "hashes:", n_hashes, "buckets:", n_buckets > + print "mean:", mean, "stddev:", stddev, "(", stddevp, "% )" > + > + # Make sure that standard deviation of load between buckets is below > 12.5%. > + # Note: it's not a strict requirement, but a good number that passes > tests. > + if (stddevp <= 12.5) { print "PASS" } > + else { print "FAIL" } > + } > +]) > + > +m4_define([CHECK_DISTRIBUTION], [ > + AT_CHECK([tail -n $1 ovs-vswitchd.log | grep 'ofproto_dpif|DBG|.*Bucket' \ > + | sed 's/.*target=\([[0-9\.]]*\) hits=\([[0-9]]*\)/\1 \2/' \ > + | awk -f stddev.awk], [0], [stdout]) > + AT_CHECK([grep -q "buckets: $2" stdout]) > + AT_CHECK([grep -q 'PASS' stdout]) > +]) > + > +m4_define([OF_GROUP], [group_id=$1,type=select,selection_method=dp_hash]) > +m4_define([OFG_BUCKET], [bucket=weight=$1,output:10]) > + > +dnl Test load distribution in groups with up to 64 equally weighted buckets. > +m4_define([OFG_BUCKETS], [OFG_BUCKET(100)]) > +m4_for([id], [1], [64], [1], [ > + get_log_next_line_num > + AT_CHECK([ovs-ofctl -O OpenFlow15 add-group br0 \ > + "OF_GROUP(id),OFG_BUCKETS()"]) > + CHECK_DISTRIBUTION([+$LINENUM], [id]) > + m4_append([OFG_BUCKETS], [,OFG_BUCKET(100)]) > +]) > + > +OVS_VSWITCHD_STOP > +AT_CLEANUP > + > AT_SETUP([ofproto-dpif - select group with explicit dp_hash selection > method]) > > OVS_VSWITCHD_START _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
