On 2/8/21 11:09 AM, Numan Siddique wrote:
On Sat, Feb 6, 2021 at 4:00 AM Dumitru Ceara <[email protected]> wrote:

The main trait of load balancer hairpin traffic is that it never leaves
the local hypervisor.  Essentially this means that only hairpin
openflows installed for logical switches that have at least one logical
switch port bound locally can ever be hit.

Until now, if a load balancer was applied on multiple logical switches
that are connected through a distributed router, ovn-controller would
install flows to detect hairpin replies for all logical switches. In
practice this leads to a very high number of openflows out of which
most will never be used.

Instead we now use an additional action, learn(), on flows that match on
packets that create the hairpin session.  The learn() action will then
generate the necessary flows to handle hairpin replies, but only for
the local datapaths which actually generate hairpin traffic.

For example, simulating how ovn-k8s uses load balancer for services,
in a "switch per node" scenario, the script below would generate
10K (n_nodes * n_vips * n_backends) openflows on every node in table=69
(hairpin reply).  With this patch the maximum number of openflows that
can be created for hairpin replies is 200 (n_vips * n_backends).

In general, for deployments that leverage switch-per-node topologies,
the number of openflows is reduced by a factor of N, where N is the
number of nodes.

   $ cat lbs.sh
   NODES=50
   VIPS=20
   BACKENDS=10
   ovn-nbctl lr-add rtr
   for ((i = 1; i <= $NODES; i++)); do
       ovn-nbctl \
           -- ls-add ls$i \
           -- lsp-add ls$i vm$i \
           -- lsp-add ls$i ls$i-rtr \
           -- lsp-set-type ls$i-rtr router \
           -- lsp-set-options ls$i-rtr router-port=rtr-ls$i \
           -- lrp-add rtr rtr-ls$i 00:00:00:00:01:00 42.42.42.$i/24
   done

   for ((i = 1; i <= $VIPS; i++)); do
       lb=lb$i
       vip=10.10.10.$i:1
       bip=20.20.20.1:2
       for ((j = 2; j <= $BACKENDS; j++)); do
           bip="$bip,20.20.20.$j:2"
       done
       ovn-nbctl lb-add $lb $vip $backends
   done

   for ((i = 1; i <= $NODES; i++)); do
       for ((j = 1; j <= $VIPS; j++)); do
           ovn-nbctl ls-lb-add ls$i lb$j
       done
   done

   ovs-vsctl add-port br-int vm1 \
       -- set interface vm1 type=internal \
       -- set interface vm1 external-ids:iface-id=vm1

Suggested-by: Ilya Maximets <[email protected]>
Signed-off-by: Dumitru Ceara <[email protected]>

Hi Dumitru,

Hi Numan,


Thanks for this patch. The patch LGTM. I tested it out with  huge NB
DB with lots
of load balancer and backends.  I see a significant reduction in OF flows.

Thanks for the review!


Please see the few comments below.

Thanks
Numan

---
  controller/lflow.c      | 204 ++++++++++++++++---
  tests/ofproto-macros.at |   5 +-
  tests/ovn.at            | 516 +++++++++++++++++++++++++-----------------------
  3 files changed, 455 insertions(+), 270 deletions(-)

diff --git a/controller/lflow.c b/controller/lflow.c
index 946c1e0..2b7d356 100644
--- a/controller/lflow.c
+++ b/controller/lflow.c
@@ -1171,6 +1171,178 @@ add_neighbor_flows(struct ovsdb_idl_index 
*sbrec_port_binding_by_name,
      }
  }

+/* Builds the "learn()" action to be triggered by packets initiating a
+ * hairpin session.
+ *
+ * This will generate flows in table OFTABLE_CHK_LB_HAIRPIN_REPLY of the form:
+ * - match:
+ *     metadata=<orig-pkt-metadata>,ip/ipv6,ip.src=<backend>,ip.dst=<vip>
+ *     nw_proto='lb_proto',tp_src_port=<backend-port>
+ * - action:
+ *     set MLF_LOOKUP_LB_HAIRPIN_BIT=1
+ */
+static void
+add_lb_vip_hairpin_reply_action(struct in6_addr *vip6, ovs_be32 vip,
+                                uint8_t lb_proto, bool has_l4_port,
+                                uint64_t cookie, struct ofpbuf *ofpacts)
+{
+    struct match match = MATCH_CATCHALL_INITIALIZER;
+    struct ofpact_learn *ol = ofpact_put_LEARN(ofpacts);
+    struct ofpact_learn_spec *ol_spec;
+    unsigned int imm_bytes;
+    uint8_t *src_imm;
+
+    /* Once learned, hairpin reply flows are permanent until the VIP/backend
+     * is removed.
+     */
+    ol->flags = NX_LEARN_F_DELETE_LEARNED;
+    ol->idle_timeout = OFP_FLOW_PERMANENT;
+    ol->hard_timeout = OFP_FLOW_PERMANENT;
+    ol->priority = OFP_DEFAULT_PRIORITY;
+    ol->table_id = OFTABLE_CHK_LB_HAIRPIN_REPLY;
+    ol->cookie = htonll(cookie);
+
+    /* Match on metadata of the packet that created the hairpin session. */
+    ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
+
+    ol_spec->dst.field = mf_from_id(MFF_METADATA);
+    ol_spec->dst.ofs = 0;
+    ol_spec->dst.n_bits = ol_spec->dst.field->n_bits;
+    ol_spec->n_bits = ol_spec->dst.n_bits;
+    ol_spec->dst_type = NX_LEARN_DST_MATCH;
+    ol_spec->src_type = NX_LEARN_SRC_FIELD;
+    ol_spec->src.field = mf_from_id(MFF_METADATA);
+
+    /* Match on the same ETH type as the packet that created the hairpin
+     * session.
+     */
+    ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
+    ol_spec->dst.field = mf_from_id(MFF_ETH_TYPE);
+    ol_spec->dst.ofs = 0;
+    ol_spec->dst.n_bits = ol_spec->dst.field->n_bits;
+    ol_spec->n_bits = ol_spec->dst.n_bits;
+    ol_spec->dst_type = NX_LEARN_DST_MATCH;
+    ol_spec->src_type = NX_LEARN_SRC_IMMEDIATE;
+    union mf_value imm_eth_type = {
+        .be16 = !vip6 ? htons(ETH_TYPE_IP) : htons(ETH_TYPE_IPV6)
+    };
+    mf_write_subfield_value(&ol_spec->dst, &imm_eth_type, &match);
+
+    /* Push value last, as this may reallocate 'ol_spec'. */
+    imm_bytes = DIV_ROUND_UP(ol_spec->dst.n_bits, 8);
+    src_imm = ofpbuf_put_zeros(ofpacts, OFPACT_ALIGN(imm_bytes));
+    memcpy(src_imm, &imm_eth_type, imm_bytes);
+
+    /* Hairpin replies have ip.src == <backend-ip>. */
+    ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
+    if (!vip6) {
+        ol_spec->dst.field = mf_from_id(MFF_IPV4_SRC);
+        ol_spec->src.field = mf_from_id(MFF_IPV4_SRC);
+    } else {
+        ol_spec->dst.field = mf_from_id(MFF_IPV6_SRC);
+        ol_spec->src.field = mf_from_id(MFF_IPV6_SRC);
+    }
+    ol_spec->dst.ofs = 0;
+    ol_spec->dst.n_bits = ol_spec->dst.field->n_bits;
+    ol_spec->n_bits = ol_spec->dst.n_bits;
+    ol_spec->dst_type = NX_LEARN_DST_MATCH;
+    ol_spec->src_type = NX_LEARN_SRC_FIELD;
+
+    /* Hairpin replies have ip.dst == <vip>. */
+    union mf_value imm_ip;
+    ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
+    if (!vip6) {
+        ol_spec->dst.field = mf_from_id(MFF_IPV4_DST);
+        imm_ip = (union mf_value) {
+            .be32 = vip
+        };
+    } else {
+        ol_spec->dst.field = mf_from_id(MFF_IPV6_DST);
+        imm_ip = (union mf_value) {
+            .ipv6 = *vip6
+        };
+    }
+    ol_spec->dst.ofs = 0;
+    ol_spec->dst.n_bits = ol_spec->dst.field->n_bits;
+    ol_spec->n_bits = ol_spec->dst.n_bits;
+    ol_spec->dst_type = NX_LEARN_DST_MATCH;
+    ol_spec->src_type = NX_LEARN_SRC_IMMEDIATE;
+    mf_write_subfield_value(&ol_spec->dst, &imm_ip, &match);
+
+    /* Push value last, as this may reallocate 'ol_spec' */
+    imm_bytes = DIV_ROUND_UP(ol_spec->dst.n_bits, 8);
+    src_imm = ofpbuf_put_zeros(ofpacts, OFPACT_ALIGN(imm_bytes));
+    memcpy(src_imm, &imm_ip, imm_bytes);
+
+    /* Hairpin replies have the same nw_proto as packets that created the
+     * session.
+     */
+    union mf_value imm_proto = {
+        .u8 = lb_proto,
+    };
+    ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
+    ol_spec->dst.field = mf_from_id(MFF_IP_PROTO);
+    ol_spec->src.field = mf_from_id(MFF_IP_PROTO);
+    ol_spec->dst.ofs = 0;
+    ol_spec->dst.n_bits = ol_spec->dst.field->n_bits;
+    ol_spec->n_bits = ol_spec->dst.n_bits;
+    ol_spec->dst_type = NX_LEARN_DST_MATCH;
+    ol_spec->src_type = NX_LEARN_SRC_IMMEDIATE;
+    mf_write_subfield_value(&ol_spec->dst, &imm_proto, &match);
+
+    /* Push value last, as this may reallocate 'ol_spec' */
+    imm_bytes = DIV_ROUND_UP(ol_spec->dst.n_bits, 8);
+    src_imm = ofpbuf_put_zeros(ofpacts, OFPACT_ALIGN(imm_bytes));
+    memcpy(src_imm, &imm_proto, imm_bytes);
+
+    /* Hairpin replies have source port == <backend-port>. */
+    if (has_l4_port) {
+        ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
+        switch (lb_proto) {
+        case IPPROTO_TCP:
+            ol_spec->dst.field = mf_from_id(MFF_TCP_SRC);
+            ol_spec->src.field = mf_from_id(MFF_TCP_DST);
+            break;
+        case IPPROTO_UDP:
+            ol_spec->dst.field = mf_from_id(MFF_UDP_SRC);
+            ol_spec->src.field = mf_from_id(MFF_UDP_DST);
+            break;
+        case IPPROTO_SCTP:
+            ol_spec->dst.field = mf_from_id(MFF_SCTP_SRC);
+            ol_spec->src.field = mf_from_id(MFF_SCTP_DST);
+            break;
+        default:
+            OVS_NOT_REACHED();
+            break;
+        }
+        ol_spec->dst.ofs = 0;
+        ol_spec->dst.n_bits = ol_spec->dst.field->n_bits;
+        ol_spec->n_bits = ol_spec->dst.n_bits;
+        ol_spec->dst_type = NX_LEARN_DST_MATCH;
+        ol_spec->src_type = NX_LEARN_SRC_FIELD;
+    }
+
+    /* Set MLF_LOOKUP_LB_HAIRPIN_BIT for hairpin replies. */
+    ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
+    ol_spec->dst.field = mf_from_id(MFF_LOG_FLAGS);
+    ol_spec->dst.ofs = MLF_LOOKUP_LB_HAIRPIN_BIT;
+    ol_spec->dst.n_bits = 1;
+    ol_spec->n_bits = ol_spec->dst.n_bits;
+    ol_spec->dst_type = NX_LEARN_DST_LOAD;
+    ol_spec->src_type = NX_LEARN_SRC_IMMEDIATE;
+    union mf_value imm_reg_value = {
+        .u8 = 1
+    };
+    mf_write_subfield_value(&ol_spec->dst, &imm_reg_value, &match);
+
+    /* Push value last, as this may reallocate 'ol_spec' */
+    imm_bytes = DIV_ROUND_UP(ol_spec->dst.n_bits, 8);
+    src_imm = ofpbuf_put_zeros(ofpacts, OFPACT_ALIGN(imm_bytes));
+    memcpy(src_imm, &imm_reg_value, imm_bytes);
+
+    ofpact_finish_LEARN(ofpacts, &ol);

The above code makes perfect sense. But I think we can reduce the amount of
code by making use of the 'learn_parse()' function present in
lib/learn.c of OVS.

This would also make the code much more simpler. Otherwise the reader
has to understand all the intricacies of adding 'learn' ofpact.

I tried something like below to test it out and It worked fine to me

*****************
struct ds learn = DS_EMPTY_INITIALIZER;
ds_put_format(&learn, "table=%d,delete_learned,OXM_OF_METADATA[],"

"eth_type=0x800,nw_proto=6,NXM_OF_IP_SRC[],ip_dst="IP_FMT","

"NXM_OF_TCP_SRC[]=NXM_OF_TCP_DST[],load:0x1->NXM_NX_REG10[7]",
                         OFTABLE_CHK_LB_HAIRPIN_REPLY, IP_ARGS(vip));
char *error = learn_parse(ds_cstr(&learn), NULL, NULL, &ofpbuf_learn);
if (!error) {
ofpbuf_put(ofpacts, ofpbuf_learn.data, ofpbuf_learn.size);
}
...
..

******************

What do you think ?

To be honest, I'm not so sure about this. It would create a dependency on how OVS formats openflows internally.

I think we can argue the same thing about intricacies of adding other OVS actions, e.g.:

https://github.com/ovn-org/ovn/blob/44ea2ec88136f83e7eab9790473025b6c95bdcc0/controller/lflow.c#L1267

https://github.com/ovn-org/ovn/blob/44ea2ec88136f83e7eab9790473025b6c95bdcc0/lib/actions.c#L422

We could change those too and use OVS parsing internal utilities to parse strings but we don't and, in my opinion, the proper way to go is to configure the actions programatically by using the data structures in include/openvswitch/ofp-actions.h

If it helps I can try to refactor a bit the code that creates the learn() action for hairpin although I'm not sure at the moment how to do that.


Thanks
Numan

Regards,
Dumitru

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to