On 08.01.2020 06:18, Vishal Deep Ajmera wrote: > Problem: > -------- > In OVS-DPDK, flows with output over a bond interface of type “balance-tcp” > (using a hash on TCP/UDP 5-tuple) get translated by the ofproto layer into > "HASH" and "RECIRC" datapath actions. After recirculation, the packet is > forwarded to the bond member port based on 8-bits of the datapath hash > value computed through dp_hash. This causes performance degradation in the > following ways: > > 1. The recirculation of the packet implies another lookup of the packet’s > flow key in the exact match cache (EMC) and potentially Megaflow classifier > (DPCLS). This is the biggest cost factor. > > 2. The recirculated packets have a new “RSS” hash and compete with the > original packets for the scarce number of EMC slots. This implies more > EMC misses and potentially EMC thrashing causing costly DPCLS lookups. > > 3. The 256 extra megaflow entries per bond for dp_hash bond selection put > additional load on the revalidation threads. > > Owing to this performance degradation, deployments stick to “balance-slb” > bond mode even though it does not do active-active load balancing for > VXLAN- and GRE-tunnelled traffic because all tunnel packet have the same > source MAC address. > > Proposed optimization: > ---------------------- > This proposal introduces a new load-balancing output action instead of > recirculation. > > Maintain one table per-bond (could just be an array of uint16's) and > program it the same way internal flows are created today for each possible > hash value(256 entries) from ofproto layer. Use this table to load-balance > flows as part of output action processing. > > Currently xlate_normal() -> output_normal() -> bond_update_post_recirc_rules() > -> bond_may_recirc() and compose_output_action__() generate > “dp_hash(hash_l4(0))” and “recirc(<RecircID>)” actions. In this case the > RecircID identifies the bond. For the recirculated packets the ofproto layer > installs megaflow entries that match on RecircID and masked dp_hash and send > them to the corresponding output port. > > Instead, we will now generate action as > "lb_output(bond,<bond id>)" > > This combines hash computation (only if needed, else re-use RSS hash) and > inline load-balancing over the bond. This action is used *only* for > balance-tcp > bonds in OVS-DPDK datapath (the OVS kernel datapath remains unchanged). > > Example: > -------- > Current scheme: > --------------- > With 1 IP-UDP flow: > > flow-dump from pmd on cpu core: 2 > recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), > packets:2828969, bytes:181054016, used:0.000s, > actions:hash(hash_l4(0)),recirc(0x1) > > recirc_id(0x1),dp_hash(0x113683bd/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), > packets:2828937, bytes:181051968, used:0.000s, actions:2 > > With 8 IP-UDP flows (with random UDP src port): (all hitting same DPCL): > > flow-dump from pmd on cpu core: 2 > recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), > packets:2674009, bytes:171136576, used:0.000s, > actions:hash(hash_l4(0)),recirc(0x1) > > recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), > packets:377395, bytes:24153280, used:0.000s, actions:2 > recirc_id(0x1),dp_hash(0xb236c260/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), > packets:333486, bytes:21343104, used:0.000s, actions:1 > recirc_id(0x1),dp_hash(0x7d89eb18/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), > packets:348461, bytes:22301504, used:0.000s, actions:1 > recirc_id(0x1),dp_hash(0xa78d75df/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), > packets:633353, bytes:40534592, used:0.000s, actions:2 > recirc_id(0x1),dp_hash(0xb58d846f/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), > packets:319901, bytes:20473664, used:0.001s, actions:2 > recirc_id(0x1),dp_hash(0x24534406/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), > packets:334985, bytes:21439040, used:0.001s, actions:1 > recirc_id(0x1),dp_hash(0x3cf32550/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), > packets:326404, bytes:20889856, used:0.001s, actions:1 > > New scheme: > ----------- > We can do with a single flow entry (for any number of new flows): > > in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), > packets:2674009, bytes:171136576, used:0.000s, actions:lb_output(bond,1) > > A new CLI has been added to dump datapath bond cache as given below. > > “sudo ovs-appctl dpif-netdev/dp-bond-show [dp]” > > root@ubuntu-190:performance_scripts # sudo ovs-appctl dpif-netdev/dp-bond-show > Bond cache: > bond-id 1 : > bucket 0 - slave 2 > bucket 1 - slave 1 > bucket 2 - slave 2 > bucket 3 - slave 1 > > Performance improvement: > ------------------------ > With a prototype of the proposed idea, the following perf improvement is seen > with Phy-VM-Phy UDP traffic, single flow. With multiple flows, the improvement > is even more enhanced (due to reduced number of flows). > > 1 VM: > ***** > +--------------------------------------+ > | mpps | > +--------------------------------------+ > | Flows master with-opt. %delta | > +--------------------------------------+ > | 1 4.47 5.73 28.21 > | 10 4.17 5.35 28.45 > | 400 3.52 5.39 53.01 > | 1k 3.41 5.25 53.78 > | 10k 2.53 4.57 80.44 > | 100k 2.32 4.27 83.59 > | 500k 2.32 4.27 83.59 > +--------------------------------------+ > mpps: million packets per second. > packet size 64 bytes. > > Signed-off-by: Manohar Krishnappa Chidambaraswamy <man...@gmail.com> > Co-authored-by: Manohar Krishnappa Chidambaraswamy <man...@gmail.com> > Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajm...@ericsson.com> > > CC: Jan Scheurich <jan.scheur...@ericsson.com> > CC: Venkatesan Pradeep <venkatesan.prad...@ericsson.com> > CC: Nitin Katiyar <nitin.kati...@ericsson.com> > --- > datapath/linux/compat/include/linux/openvswitch.h | 1 + > lib/dpif-netdev.c | 502 > ++++++++++++++++++++-- > lib/dpif-netlink.c | 3 + > lib/dpif-provider.h | 9 + > lib/dpif.c | 49 +++ > lib/dpif.h | 7 + > lib/odp-execute.c | 2 + > lib/odp-util.c | 4 + > ofproto/bond.c | 56 ++- > ofproto/bond.h | 9 + > ofproto/ofproto-dpif-ipfix.c | 1 + > ofproto/ofproto-dpif-sflow.c | 1 + > ofproto/ofproto-dpif-xlate.c | 42 +- > ofproto/ofproto-dpif.c | 24 ++ > ofproto/ofproto-dpif.h | 9 +- > tests/lacp.at | 9 + > vswitchd/bridge.c | 6 + > vswitchd/vswitch.xml | 23 + > 18 files changed, 695 insertions(+), 62 deletions(-)
I started reviewing, but stopped at some point because I realized that current solution is over-complicated (too much of unnecessary memory copies) and not thread safe. See some comments inline. I'll follow up with suggestions on how to implement this functionality in a different way, since I don't think that it's possible to make current version work correctly. NACK for dpif-netdev part. Best regards, Ilya Maximets. > > diff --git a/datapath/linux/compat/include/linux/openvswitch.h > b/datapath/linux/compat/include/linux/openvswitch.h > index 2f0c655..7604a2a 100644 > --- a/datapath/linux/compat/include/linux/openvswitch.h > +++ b/datapath/linux/compat/include/linux/openvswitch.h > @@ -1021,6 +1021,7 @@ enum ovs_action_attr { > OVS_ACTION_ATTR_TUNNEL_PUSH, /* struct ovs_action_push_tnl*/ > OVS_ACTION_ATTR_TUNNEL_POP, /* u32 port number. */ > OVS_ACTION_ATTR_DROP, /* u32 xlate_error. */ > + OVS_ACTION_ATTR_LB_OUTPUT, /* u32 bond-id */ Please, add the period at the end of comment. > #endif > __OVS_ACTION_ATTR_MAX, /* Nothing past this will be accepted > * from userspace. */ > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > index 2421821..78f9cf9 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -79,6 +79,7 @@ > #include "unixctl.h" > #include "util.h" > #include "uuid.h" > +#include "ofproto/bond.h" This doesn't look good. Please, avoid including 'ofproto' headers from the 'lib' code. > > VLOG_DEFINE_THIS_MODULE(dpif_netdev); > > @@ -377,6 +378,11 @@ struct dp_netdev { > > struct conntrack *conntrack; > struct pmd_auto_lb pmd_alb; As Ben suggested, empty line here. > + /* Bonds. > + * > + * Any lookup into 'bonds' requires taking 'bond_mutex'. */ > + struct ovs_mutex bond_mutex; > + struct hmap bonds; As Ben suggested, comment here is required. > }; > > static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id) > @@ -607,6 +613,20 @@ struct tx_port { > struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST]; > }; > > +/* Contained by struct tx_bond 'slave_buckets' */ > +struct slave_entry { > + odp_port_t slave_id; > + atomic_ullong n_packets; > + atomic_ullong n_bytes; > +}; > + > +/* Contained by struct dp_netdev_pmd_thread's 'bond_cache' or 'tx_bonds'. */ > +struct tx_bond { > + struct hmap_node node; > + uint32_t bond_id; > + struct slave_entry slave_buckets[BOND_BUCKETS]; > +}; > + > /* A set of properties for the current processing loop that is not directly > * associated with the pmd thread itself, but with the packets being > * processed or the short-term system configuration (for example, time). > @@ -718,7 +738,6 @@ struct dp_netdev_pmd_thread { > atomic_bool wait_for_reload; /* Can we busy wait for the next reload? > */ > atomic_bool reload_tx_qid; /* Do we need to reload static_tx_qid? */ > atomic_bool exit; /* For terminating the pmd thread. */ > - Unrelated change. Please, keep the line. > pthread_t thread; > unsigned core_id; /* CPU core id of this pmd thread. */ > int numa_id; /* numa node id of this pmd thread. */ > @@ -739,6 +758,11 @@ struct dp_netdev_pmd_thread { > * read by the pmd thread. */ > struct hmap tx_ports OVS_GUARDED; > > + struct ovs_mutex bond_mutex; /* Mutex for 'tx_bonds'. */ > + /* Map of 'tx_bond's used for transmission. Written by the main thread, > + * read/written by the pmd thread. */ Why PMD needs to write into this hash map? Can we avoid this? Sounds dangerous. > + struct hmap tx_bonds OVS_GUARDED; > + > /* These are thread-local copies of 'tx_ports'. One contains only tunnel > * ports (that support push_tunnel/pop_tunnel), the other contains ports > * with at least one txq (that support send). A port can be in both. > @@ -751,6 +775,8 @@ struct dp_netdev_pmd_thread { > * other instance will only be accessed by its own pmd thread. */ > struct hmap tnl_port_cache; > struct hmap send_port_cache; > + /* These are thread-local copies of 'tx_bonds' */ > + struct hmap bond_cache; > > /* Keep track of detailed PMD performance statistics. */ > struct pmd_perf_stats perf_stats; > @@ -830,6 +856,12 @@ static void dp_netdev_del_rxq_from_pmd(struct > dp_netdev_pmd_thread *pmd, > static int > dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd, > bool force); > +static void dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd, > + struct tx_bond *bond) > + OVS_REQUIRES(pmd->bond_mutex); > +static void dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd, > + struct tx_bond *tx) > + OVS_REQUIRES(pmd->bond_mutex); > > static void reconfigure_datapath(struct dp_netdev *dp) > OVS_REQUIRES(dp->port_mutex); > @@ -838,6 +870,10 @@ static void dp_netdev_pmd_unref(struct > dp_netdev_pmd_thread *pmd); > static void dp_netdev_pmd_flow_flush(struct dp_netdev_pmd_thread *pmd); > static void pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd) > OVS_REQUIRES(pmd->port_mutex); > +static void pmd_load_cached_bonds(struct dp_netdev_pmd_thread *pmd) > + OVS_REQUIRES(pmd->bond_mutex); > +static void pmd_load_bond_cache(struct dp_netdev_pmd_thread *pmd); > + > static inline void > dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd, > struct polled_queue *poll_list, int poll_cnt); > @@ -1396,6 +1432,45 @@ pmd_perf_show_cmd(struct unixctl_conn *conn, int argc, > par.command_type = PMD_INFO_PERF_SHOW; > dpif_netdev_pmd_info(conn, argc, argv, &par); > } > + > +static void > +dpif_netdev_dp_bond_show(struct unixctl_conn *conn, int argc, > + const char *argv[], void *aux OVS_UNUSED) > +{ > + struct ds reply = DS_EMPTY_INITIALIZER; > + > + ovs_mutex_lock(&dp_netdev_mutex); > + > + struct dp_netdev *dp = NULL; > + if (argc == 2) { > + dp = shash_find_data(&dp_netdevs, argv[1]); > + } else if (shash_count(&dp_netdevs) == 1) { > + /* There's only one datapath */ > + dp = shash_first(&dp_netdevs)->data; > + } > + if (!dp) { > + ovs_mutex_unlock(&dp_netdev_mutex); > + unixctl_command_reply_error(conn, > + "please specify an existing datapath"); > + return; > + } > + ds_put_cstr(&reply, "\nBonds:\n"); > + > + struct tx_bond *dp_bond_entry; > + HMAP_FOR_EACH (dp_bond_entry, node, &dp->bonds) { This requires holding dp->bond_mutex. > + ds_put_format(&reply, "\tbond-id %u :\n", > + dp_bond_entry->bond_id); > + for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) { > + ds_put_format(&reply, "\t\tbucket %u - slave %u \n", > + bucket, > + dp_bond_entry->slave_buckets[bucket].slave_id); > + } > + } > + ovs_mutex_unlock(&dp_netdev_mutex); > + unixctl_command_reply(conn, ds_cstr(&reply)); > + ds_destroy(&reply); > +} > + > > static int > dpif_netdev_init(void) > @@ -1427,6 +1502,9 @@ dpif_netdev_init(void) > "[-us usec] [-q qlen]", > 0, 10, pmd_perf_log_set_cmd, > NULL); > + unixctl_command_register("dpif-netdev/dp-bond-show", "[dp]", > + 0, 1, dpif_netdev_dp_bond_show, > + NULL); > return 0; > } > > @@ -1551,6 +1629,9 @@ create_dp_netdev(const char *name, const struct > dpif_class *class, > ovs_mutex_init(&dp->port_mutex); > hmap_init(&dp->ports); > dp->port_seq = seq_create(); > + ovs_mutex_init(&dp->bond_mutex); > + hmap_init(&dp->bonds); > + > fat_rwlock_init(&dp->upcall_rwlock); > > dp->reconfigure_seq = seq_create(); > @@ -1674,6 +1755,14 @@ dp_netdev_free(struct dp_netdev *dp) > } > ovs_mutex_unlock(&dp->port_mutex); > > + ovs_mutex_lock(&dp->bond_mutex); > + struct tx_bond *bond, *next_bond; > + HMAP_FOR_EACH_SAFE (bond, next_bond, node, &dp->bonds) { > + hmap_remove(&dp->bonds, &bond->node); > + free(bond); > + } > + ovs_mutex_unlock(&dp->bond_mutex); > + > dp_netdev_destroy_all_pmds(dp, true); > cmap_destroy(&dp->poll_threads); > > @@ -1692,6 +1781,9 @@ dp_netdev_free(struct dp_netdev *dp) > hmap_destroy(&dp->ports); > ovs_mutex_destroy(&dp->port_mutex); > > + hmap_destroy(&dp->bonds); > + ovs_mutex_destroy(&dp->bond_mutex); > + > /* Upcalls must be disabled at this point */ > dp_netdev_destroy_upcall_lock(dp); > > @@ -1795,6 +1887,7 @@ dp_netdev_reload_pmd__(struct dp_netdev_pmd_thread *pmd) > ovs_mutex_lock(&pmd->port_mutex); > pmd_load_cached_ports(pmd); > ovs_mutex_unlock(&pmd->port_mutex); > + pmd_load_bond_cache(pmd); > ovs_mutex_unlock(&pmd->dp->non_pmd_mutex); > return; > } > @@ -1809,6 +1902,12 @@ hash_port_no(odp_port_t port_no) > return hash_int(odp_to_u32(port_no), 0); > } > > +static uint32_t > +hash_bond_id(uint32_t bond_id) > +{ > + return hash_int(bond_id, 0); > +} > + > static int > port_create(const char *devname, const char *type, > odp_port_t port_no, struct dp_netdev_port **portp) > @@ -4349,6 +4448,19 @@ tx_port_lookup(const struct hmap *hmap, odp_port_t > port_no) > return NULL; > } > > +static struct tx_bond * > +tx_bond_lookup(const struct hmap *hmap, uint32_t bond_id) > +{ > + struct tx_bond *tx; > + > + HMAP_FOR_EACH_IN_BUCKET (tx, node, hash_bond_id(bond_id), hmap) { Why not HMAP_FOR_EACH_WITH_HASH ? BTW, it's better to calculate hash before the loop to be sure that we're not recalculating it on each iteration. > + if (tx->bond_id == bond_id) { > + return tx; > + } > + } > + return NULL; > +} > + > static int > port_reconfigure(struct dp_netdev_port *port) > { > @@ -4837,6 +4949,26 @@ pmd_remove_stale_ports(struct dp_netdev *dp, > ovs_mutex_unlock(&pmd->port_mutex); > } > > +static void > +pmd_remove_stale_bonds(struct dp_netdev *dp, > + struct dp_netdev_pmd_thread *pmd) > + OVS_EXCLUDED(pmd->bond_mutex) > + OVS_EXCLUDED(dp->bond_mutex) > +{ > + ovs_mutex_lock(&dp->bond_mutex); > + ovs_mutex_lock(&pmd->bond_mutex); > + > + struct tx_bond *tx, *tx_next; > + HMAP_FOR_EACH_SAFE (tx, tx_next, node, &pmd->tx_bonds) { > + if (!tx_bond_lookup(&dp->bonds, tx->bond_id)) { > + dp_netdev_del_bond_tx_from_pmd(pmd, tx); > + } > + } > + > + ovs_mutex_unlock(&pmd->bond_mutex); > + ovs_mutex_unlock(&dp->bond_mutex); > +} > + > /* Must be called each time a port is added/removed or the cmask changes. > * This creates and destroys pmd threads, reconfigures ports, opens their > * rxqs and assigns all rxqs/txqs to pmd threads. */ > @@ -4875,10 +5007,11 @@ reconfigure_datapath(struct dp_netdev *dp) > } > } > > - /* Remove from the pmd threads all the ports that have been deleted or > - * need reconfiguration. */ > + /* Remove from the pmd threads all the ports/bonds that have been deleted > + * or need reconfiguration. */ > CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { > pmd_remove_stale_ports(dp, pmd); > + pmd_remove_stale_bonds(dp, pmd); > } > > /* Reload affected pmd threads. We must wait for the pmd threads before > @@ -5000,6 +5133,21 @@ reconfigure_datapath(struct dp_netdev *dp) > ovs_mutex_unlock(&pmd->port_mutex); > } > > + /* Add every bond to the tx cache of every pmd thread, if it's not > + * there already and if this pmd has at least one rxq to poll. */ > + ovs_mutex_lock(&dp->bond_mutex); > + CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { > + ovs_mutex_lock(&pmd->bond_mutex); > + if (hmap_count(&pmd->poll_list) || pmd->core_id == NON_PMD_CORE_ID) { > + struct tx_bond *bond; > + HMAP_FOR_EACH (bond, node, &dp->bonds) { > + dp_netdev_add_bond_tx_to_pmd(pmd, bond); > + } > + } > + ovs_mutex_unlock(&pmd->bond_mutex); > + } > + ovs_mutex_unlock(&dp->bond_mutex); > + > /* Reload affected pmd threads. */ > reload_affected_pmds(dp); > > @@ -5258,7 +5406,6 @@ pmd_rebalance_dry_run(struct dp_netdev *dp) > return ret; > } > > - Unrelated change. > /* Return true if needs to revalidate datapath flows. */ > static bool > dpif_netdev_run(struct dpif *dpif) > @@ -5428,6 +5575,56 @@ pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd) > } > > static void > +pmd_free_cached_bonds(struct dp_netdev_pmd_thread *pmd) > +{ > + struct tx_bond *bond, *next; > + > + /* Remove bonds from pmd which no longer exists. */ > + HMAP_FOR_EACH_SAFE (bond, next, node, &pmd->bond_cache) { > + struct tx_bond *tx = NULL; > + > + tx = tx_bond_lookup(&pmd->tx_bonds, bond->bond_id); This lookup requires holding pmd->bond_mutex. > + if (!tx) { > + /* Bond no longer exist. Delete it from pmd. */ > + hmap_remove(&pmd->bond_cache, &bond->node); > + free(bond); > + } > + } > +} > + > +/* Copies bonds from 'pmd->tx_bonds' (shared with the main thread) to > + * 'pmd->bond_cache' (thread local) */ > +static void > +pmd_load_cached_bonds(struct dp_netdev_pmd_thread *pmd) > + OVS_REQUIRES(pmd->bond_mutex) > +{ > + pmd_free_cached_bonds(pmd); > + hmap_shrink(&pmd->bond_cache); > + > + struct tx_bond *tx_bond, *tx_bond_cached; > + HMAP_FOR_EACH (tx_bond, node, &pmd->tx_bonds) { > + /* Check if bond already exist on pmd. */ > + tx_bond_cached = tx_bond_lookup(&pmd->bond_cache, tx_bond->bond_id); You can't do a lookup in PMDs local cache. Only PMD thread is allowed to use it. > + > + if (!tx_bond_cached) { > + /* Create new bond entry in cache. */ > + tx_bond_cached = xmemdup(tx_bond, sizeof *tx_bond_cached); > + hmap_insert(&pmd->bond_cache, &tx_bond_cached->node, > + hash_bond_id(tx_bond_cached->bond_id)); > + } else { > + /* Update the slave-map. */ > + for (int bucket = 0; bucket <= BOND_MASK; bucket++) { > + tx_bond_cached->slave_buckets[bucket].slave_id = > + tx_bond->slave_buckets[bucket].slave_id; > + } > + } > + VLOG_DBG("Caching bond-id %d pmd %d", > + tx_bond_cached->bond_id, pmd->core_id); > + } > +} > + > + > +static void > pmd_alloc_static_tx_qid(struct dp_netdev_pmd_thread *pmd) > { > ovs_mutex_lock(&pmd->dp->tx_qid_pool_mutex); > @@ -5449,6 +5646,14 @@ pmd_free_static_tx_qid(struct dp_netdev_pmd_thread > *pmd) > ovs_mutex_unlock(&pmd->dp->tx_qid_pool_mutex); > } > > +static void > +pmd_load_bond_cache(struct dp_netdev_pmd_thread *pmd) > +{ > + ovs_mutex_lock(&pmd->bond_mutex); > + pmd_load_cached_bonds(pmd); > + ovs_mutex_unlock(&pmd->bond_mutex); > +} Can we avoid introducing of this function by moving mutex acquaring inside the 'pmd_load_cached_bonds'. Function names are too similar. It's confusing. > + > static int > pmd_load_queues_and_ports(struct dp_netdev_pmd_thread *pmd, > struct polled_queue **ppoll_list) > @@ -5476,6 +5681,8 @@ pmd_load_queues_and_ports(struct dp_netdev_pmd_thread > *pmd, > > ovs_mutex_unlock(&pmd->port_mutex); > > + pmd_load_bond_cache(pmd); > + > *ppoll_list = poll_list; > return i; > } > @@ -6038,6 +6245,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread > *pmd, struct dp_netdev *dp, > atomic_init(&pmd->reload, false); > ovs_mutex_init(&pmd->flow_mutex); > ovs_mutex_init(&pmd->port_mutex); > + ovs_mutex_init(&pmd->bond_mutex); > cmap_init(&pmd->flow_table); > cmap_init(&pmd->classifiers); > pmd->ctx.last_rxq = NULL; > @@ -6048,6 +6256,8 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread > *pmd, struct dp_netdev *dp, > hmap_init(&pmd->tx_ports); > hmap_init(&pmd->tnl_port_cache); > hmap_init(&pmd->send_port_cache); > + hmap_init(&pmd->tx_bonds); > + hmap_init(&pmd->bond_cache); > /* init the 'flow_cache' since there is no > * actual thread created for NON_PMD_CORE_ID. */ > if (core_id == NON_PMD_CORE_ID) { > @@ -6068,6 +6278,8 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd) > hmap_destroy(&pmd->send_port_cache); > hmap_destroy(&pmd->tnl_port_cache); > hmap_destroy(&pmd->tx_ports); > + hmap_destroy(&pmd->bond_cache); > + hmap_destroy(&pmd->tx_bonds); > hmap_destroy(&pmd->poll_list); > /* All flows (including their dpcls_rules) have been deleted already */ > CMAP_FOR_EACH (cls, node, &pmd->classifiers) { > @@ -6079,6 +6291,7 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd) > ovs_mutex_destroy(&pmd->flow_mutex); > seq_destroy(pmd->reload_seq); > ovs_mutex_destroy(&pmd->port_mutex); > + ovs_mutex_destroy(&pmd->bond_mutex); > free(pmd); > } > > @@ -6232,6 +6445,45 @@ dp_netdev_del_port_tx_from_pmd(struct > dp_netdev_pmd_thread *pmd, > free(tx); > pmd->need_reload = true; > } > + > +static void > +dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd, > + struct tx_bond *bond) > + OVS_REQUIRES(pmd->bond_mutex) > +{ > + bool reload = false; > + struct tx_bond *tx = tx_bond_lookup(&pmd->tx_bonds, bond->bond_id); > + if (tx) { > + /* Check if mapping is changed. */ > + for (int i = 0; i <= BOND_MASK; i++) { > + if (bond->slave_buckets[i].slave_id != > + tx->slave_buckets[i].slave_id) { > + /* Mapping is modified. Reload pmd bond cache again. */ > + reload = true; > + } > + /* Copy the map always. */ > + tx->slave_buckets[i].slave_id = bond->slave_buckets[i].slave_id; > + } > + } else { > + tx = xmemdup(bond, sizeof *tx); > + hmap_insert(&pmd->tx_bonds, &tx->node, hash_bond_id(bond->bond_id)); > + reload = true; > + } > + if (reload) { > + pmd->need_reload = true; > + } > +} > + > +/* Del 'tx' from the tx bond cache of 'pmd' */ > +static void > +dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd, > + struct tx_bond *tx) > + OVS_REQUIRES(pmd->bond_mutex) > +{ > + hmap_remove(&pmd->tx_bonds, &tx->node); > + free(tx); > + pmd->need_reload = true; > +} > > static char * > dpif_netdev_get_datapath_version(void) > @@ -7007,6 +7259,13 @@ pmd_send_port_cache_lookup(const struct > dp_netdev_pmd_thread *pmd, > return tx_port_lookup(&pmd->send_port_cache, port_no); > } > > +static struct tx_bond * > +pmd_tx_bond_cache_lookup(const struct dp_netdev_pmd_thread *pmd, > + uint32_t bond_id) > +{ > + return tx_bond_lookup(&pmd->bond_cache, bond_id); > +} > + > static int > push_tnl_action(const struct dp_netdev_pmd_thread *pmd, > const struct nlattr *attr, > @@ -7057,6 +7316,91 @@ dp_execute_userspace_action(struct > dp_netdev_pmd_thread *pmd, > } > } > > +static bool > +dp_execute_output_action(struct dp_netdev_pmd_thread *pmd, > + struct dp_packet_batch *packets_, > + bool should_steal, odp_port_t port_no) > +{ > + struct tx_port *p = pmd_send_port_cache_lookup(pmd, port_no); > + if (!OVS_LIKELY(p)) { OVS_UNLIKELY(!p) > + return false; > + } > + > + struct dp_packet_batch out; > + if (!should_steal) { > + dp_packet_batch_clone(&out, packets_); > + dp_packet_batch_reset_cutlen(packets_); > + packets_ = &out; > + } > + dp_packet_batch_apply_cutlen(packets_); > +#ifdef DPDK_NETDEV > + if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts) > + && packets_->packets[0]->source > + != p->output_pkts.packets[0]->source)) { Plaese, align this the same way as it was aligned before. > + /* netdev-dpdk assumes that all packets in a single > + * output batch has the same source. Flush here to > + * avoid memory access issues. */ > + dp_netdev_pmd_flush_output_on_port(pmd, p); > + } > +#endif > + if (dp_packet_batch_size(&p->output_pkts) > + + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) { > + /* Flush here to avoid overflow. */ > + dp_netdev_pmd_flush_output_on_port(pmd, p); > + } > + if (dp_packet_batch_is_empty(&p->output_pkts)) { > + pmd->n_output_batches++; > + } > + > + struct dp_packet *packet; > + DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) { > + p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] = > + pmd->ctx.last_rxq; > + dp_packet_batch_add(&p->output_pkts, packet); > + } > + return true; > +} > + > +static bool > +dp_execute_lb_output_action(struct dp_netdev_pmd_thread *pmd, > + struct dp_packet_batch *packets_, > + bool should_steal, uint32_t bond) > +{ > + struct tx_bond *p_bond = pmd_tx_bond_cache_lookup(pmd, bond); > + if (!p_bond) { OVS_UNLIKELY ? > + return false; > + } > + > + struct dp_packet_batch del_pkts; > + dp_packet_batch_init(&del_pkts); > + > + struct dp_packet *packet; > + DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) { > + /* > + * Lookup the bond-hash table using hash to get the slave. > + */ > + uint32_t hash = dp_packet_get_rss_hash(packet); > + struct slave_entry *s_entry = &p_bond->slave_buckets[hash & > BOND_MASK]; > + odp_port_t bond_member = s_entry->slave_id; > + uint32_t size = dp_packet_size(packet); > + > + struct dp_packet_batch output_pkt; > + dp_packet_batch_init_packet(&output_pkt, packet); > + if (OVS_LIKELY(dp_execute_output_action(pmd, &output_pkt, > should_steal, > + bond_member))) { > + /* Update slave stats. */ > + non_atomic_ullong_add(&s_entry->n_packets, 1); > + non_atomic_ullong_add(&s_entry->n_bytes, size); > + } else { > + dp_packet_batch_add(&del_pkts, packet); > + } > + } > + > + /* Delete packets that failed OUTPUT action */ > + dp_packet_delete_batch(&del_pkts, should_steal); We have recently introduced drop stats. No packets should be dropped now without accounting. Please, introduce new counter for that. Details: a13a0209750c ("userspace: Improved packet drop statistics.") > + return true; > +} > + > static void > dp_execute_cb(void *aux_, struct dp_packet_batch *packets_, > const struct nlattr *a, bool should_steal) > @@ -7072,43 +7416,15 @@ dp_execute_cb(void *aux_, struct dp_packet_batch > *packets_, > > switch ((enum ovs_action_attr)type) { > case OVS_ACTION_ATTR_OUTPUT: > - p = pmd_send_port_cache_lookup(pmd, nl_attr_get_odp_port(a)); > - if (OVS_LIKELY(p)) { > - struct dp_packet *packet; > - struct dp_packet_batch out; > - > - if (!should_steal) { > - dp_packet_batch_clone(&out, packets_); > - dp_packet_batch_reset_cutlen(packets_); > - packets_ = &out; > - } > - dp_packet_batch_apply_cutlen(packets_); > - > -#ifdef DPDK_NETDEV > - if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts) > - && packets_->packets[0]->source > - != p->output_pkts.packets[0]->source)) { > - /* XXX: netdev-dpdk assumes that all packets in a single > - * output batch has the same source. Flush here to > - * avoid memory access issues. */ > - dp_netdev_pmd_flush_output_on_port(pmd, p); > - } > -#endif > - if (dp_packet_batch_size(&p->output_pkts) > - + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) { > - /* Flush here to avoid overflow. */ > - dp_netdev_pmd_flush_output_on_port(pmd, p); > - } > - > - if (dp_packet_batch_is_empty(&p->output_pkts)) { > - pmd->n_output_batches++; > - } > + if (dp_execute_output_action(pmd, packets_, should_steal, > + nl_attr_get_odp_port(a))) { > + return; > + } > + break; > > - DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) { > - p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] = > - > pmd->ctx.last_rxq; > - dp_packet_batch_add(&p->output_pkts, packet); > - } > + case OVS_ACTION_ATTR_LB_OUTPUT: > + if (dp_execute_lb_output_action(pmd, packets_, should_steal, > + nl_attr_get_u32(a))) { > return; > } else { > COVERAGE_ADD(datapath_drop_invalid_port, This doesn't look correct. 'else' branch is for OVS_ACTION_ATTR_OUTPUT. > @@ -7666,6 +7982,109 @@ dpif_netdev_ipf_dump_done(struct dpif *dpif > OVS_UNUSED, void *ipf_dump_ctx) > > } > > +static int > +dpif_netdev_bond_add(struct dpif *dpif, uint32_t bond_id, > + odp_port_t slave_map[]) > +{ > + struct dp_netdev *dp = get_dp_netdev(dpif); > + > + ovs_mutex_lock(&dp->bond_mutex); > + /* > + * Lookup for the bond. If already exists, just update the slave-map. > + * Else create new. > + */ > + struct tx_bond *dp_bond_entry = tx_bond_lookup(&dp->bonds, bond_id); > + if (dp_bond_entry) { > + for (int bucket = 0; bucket <= BOND_MASK; bucket++) { > + dp_bond_entry->slave_buckets[bucket].slave_id = > slave_map[bucket]; > + } > + > + struct dp_netdev_pmd_thread *pmd; > + CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { > + ovs_mutex_lock(&pmd->bond_mutex); > + dp_netdev_add_bond_tx_to_pmd(pmd, dp_bond_entry); > + ovs_mutex_unlock(&pmd->bond_mutex); > + } > + } else { > + struct tx_bond *dp_bond = xzalloc(sizeof *dp_bond); > + dp_bond->bond_id = bond_id; > + for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) { > + dp_bond->slave_buckets[bucket].slave_id = slave_map[bucket]; > + } > + hmap_insert(&dp->bonds, &dp_bond->node, > + hash_bond_id(dp_bond->bond_id)); > + > + /* Insert the bond map in all pmds. */ > + struct dp_netdev_pmd_thread *pmd; > + CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { > + ovs_mutex_lock(&pmd->bond_mutex); > + dp_netdev_add_bond_tx_to_pmd(pmd, dp_bond); > + ovs_mutex_unlock(&pmd->bond_mutex); > + } > + } > + ovs_mutex_unlock(&dp->bond_mutex); > + return 0; > +} > + > +static int > +dpif_netdev_bond_del(struct dpif *dpif, uint32_t bond_id) > +{ > + struct dp_netdev *dp = get_dp_netdev(dpif); > + > + ovs_mutex_lock(&dp->bond_mutex); > + > + /* Find the bond and delete it if present */ > + struct tx_bond *dp_bond_entry = tx_bond_lookup(&dp->bonds, bond_id); > + if (dp_bond_entry) { > + /* Remove the bond map in all pmds. */ > + struct dp_netdev_pmd_thread *pmd; > + CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { > + ovs_mutex_lock(&pmd->bond_mutex); > + dp_netdev_del_bond_tx_from_pmd(pmd, dp_bond_entry); > + ovs_mutex_unlock(&pmd->bond_mutex); > + } > + hmap_remove(&dp->bonds, &dp_bond_entry->node); > + free(dp_bond_entry); > + } > + > + ovs_mutex_unlock(&dp->bond_mutex); > + return 0; > +} > + > +static int > +dpif_netdev_bond_stats_get(struct dpif *dpif, uint32_t bond_id, > + uint64_t *n_bytes) > +{ > + struct dp_netdev *dp = get_dp_netdev(dpif); > + > + ovs_mutex_lock(&dp->bond_mutex); > + > + /* Find the bond and retrieve stats if present */ > + struct tx_bond *dp_bond_entry = tx_bond_lookup(&dp->bonds, bond_id); > + if (dp_bond_entry) { > + /* Search the bond in all PMDs */ > + struct dp_netdev_pmd_thread *pmd; > + CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { > + ovs_mutex_lock(&pmd->bond_mutex); > + struct tx_bond *pmd_bond_entry > + = tx_bond_lookup(&pmd->bond_cache, bond_id); You can't access pmd->bond_cache. Only the owning PMD thread allowed to do that. > + if (pmd_bond_entry) { > + /* Read bond stats. */ > + for (int i = 0; i <= BOND_MASK; i++) { > + uint64_t pmd_n_bytes; > + atomic_read_relaxed( > + &pmd_bond_entry->slave_buckets[i].n_bytes, > + &pmd_n_bytes); > + n_bytes[i] += pmd_n_bytes; > + } > + } > + ovs_mutex_unlock(&pmd->bond_mutex); > + } > + } > + ovs_mutex_unlock(&dp->bond_mutex); > + return 0; > +} > + > const struct dpif_class dpif_netdev_class = { > "netdev", > true, /* cleanup_required */ > @@ -7739,6 +8158,9 @@ const struct dpif_class dpif_netdev_class = { > dpif_netdev_meter_set, > dpif_netdev_meter_get, > dpif_netdev_meter_del, > + dpif_netdev_bond_add, > + dpif_netdev_bond_del, > + dpif_netdev_bond_stats_get, > }; > > static void > diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c > index 3d97c78..b1555df 100644 > --- a/lib/dpif-netlink.c > +++ b/lib/dpif-netlink.c > @@ -3995,6 +3995,9 @@ const struct dpif_class dpif_netlink_class = { > dpif_netlink_meter_set, > dpif_netlink_meter_get, > dpif_netlink_meter_del, > + NULL, /* bond_add */ > + NULL, /* bond_del */ > + NULL, /* bond_stats_get */ > }; > > static int > diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h > index b77317b..4415836 100644 > --- a/lib/dpif-provider.h > +++ b/lib/dpif-provider.h > @@ -616,6 +616,15 @@ struct dpif_class { > * zero. */ > int (*meter_del)(struct dpif *, ofproto_meter_id meter_id, > struct ofputil_meter_stats *, uint16_t n_bands); > + > + /* Adds a bond with 'bond_id' and the slave-map to 'dpif'. */ > + int (*bond_add)(struct dpif *dpif, uint32_t bond_id, > + odp_port_t slave_map[]); > + /* Removes bond identified by 'bond_id' from 'dpif'. */ > + int (*bond_del)(struct dpif *dpif, uint32_t bond_id); > + /* Reads bond stats from 'dpif'. */ > + int (*bond_stats_get)(struct dpif *dpif, uint32_t bond_id, > + uint64_t *n_bytes); > }; > > extern const struct dpif_class dpif_netlink_class; > diff --git a/lib/dpif.c b/lib/dpif.c > index 955eaf6..41c0886 100644 > --- a/lib/dpif.c > +++ b/lib/dpif.c > @@ -1186,6 +1186,7 @@ dpif_execute_helper_cb(void *aux_, struct > dp_packet_batch *packets_, > > case OVS_ACTION_ATTR_CT: > case OVS_ACTION_ATTR_OUTPUT: > + case OVS_ACTION_ATTR_LB_OUTPUT: > case OVS_ACTION_ATTR_TUNNEL_PUSH: > case OVS_ACTION_ATTR_TUNNEL_POP: > case OVS_ACTION_ATTR_USERSPACE: > @@ -1236,6 +1237,7 @@ dpif_execute_helper_cb(void *aux_, struct > dp_packet_batch *packets_, > struct dp_packet *clone = NULL; > uint32_t cutlen = dp_packet_get_cutlen(packet); > if (cutlen && (type == OVS_ACTION_ATTR_OUTPUT > + || type == OVS_ACTION_ATTR_LB_OUTPUT > || type == OVS_ACTION_ATTR_TUNNEL_PUSH > || type == OVS_ACTION_ATTR_TUNNEL_POP > || type == OVS_ACTION_ATTR_USERSPACE)) { > @@ -1895,6 +1897,16 @@ dpif_supports_explicit_drop_action(const struct dpif > *dpif) > return dpif_is_netdev(dpif); > } > > +bool > +dpif_supports_lb_output_action(const struct dpif *dpif) > +{ > + /* > + * Balance-tcp optimization is currently supported in netdev > + * datapath only. > + */ > + return dpif_is_netdev(dpif); > +} > + > /* Meters */ > void > dpif_meter_get_features(const struct dpif *dpif, > @@ -1992,3 +2004,40 @@ dpif_meter_del(struct dpif *dpif, ofproto_meter_id > meter_id, > } > return error; > } > + > +int > +dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t slave_map[]) > +{ > + int error = -1; > + > + if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) { > + error = dpif->dpif_class->bond_add(dpif, bond_id, slave_map); > + } > + > + return error; > +} > + > +int > +dpif_bond_del(struct dpif *dpif, uint32_t bond_id) > +{ > + int error = -1; > + > + if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) { > + error = dpif->dpif_class->bond_del(dpif, bond_id); > + } > + > + return error; > +} > + > +int > +dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id, > + uint64_t *n_bytes) > +{ > + int error = -1; > + > + if (dpif && dpif->dpif_class && dpif->dpif_class->bond_stats_get) { > + error = dpif->dpif_class->bond_stats_get(dpif, bond_id, n_bytes); > + } > + > + return error; > +} > diff --git a/lib/dpif.h b/lib/dpif.h > index a88851e..6b3216f 100644 > --- a/lib/dpif.h > +++ b/lib/dpif.h > @@ -908,6 +908,13 @@ char *dpif_get_dp_version(const struct dpif *); > bool dpif_supports_tnl_push_pop(const struct dpif *); > bool dpif_supports_explicit_drop_action(const struct dpif *); > > +bool dpif_supports_lb_output_action(const struct dpif *); > + > +int dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t > slave_map[]); > +int dpif_bond_del(struct dpif *dpif, uint32_t bond_id); > +int dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id, > + uint64_t *n_bytes); > + > /* Log functions. */ > struct vlog_module; > > diff --git a/lib/odp-execute.c b/lib/odp-execute.c > index 42d3335..6018e37 100644 > --- a/lib/odp-execute.c > +++ b/lib/odp-execute.c > @@ -793,6 +793,7 @@ requires_datapath_assistance(const struct nlattr *a) > switch (type) { > /* These only make sense in the context of a datapath. */ > case OVS_ACTION_ATTR_OUTPUT: > + case OVS_ACTION_ATTR_LB_OUTPUT: > case OVS_ACTION_ATTR_TUNNEL_PUSH: > case OVS_ACTION_ATTR_TUNNEL_POP: > case OVS_ACTION_ATTR_USERSPACE: > @@ -1068,6 +1069,7 @@ odp_execute_actions(void *dp, struct dp_packet_batch > *batch, bool steal, > return; > } > case OVS_ACTION_ATTR_OUTPUT: > + case OVS_ACTION_ATTR_LB_OUTPUT: > case OVS_ACTION_ATTR_TUNNEL_PUSH: > case OVS_ACTION_ATTR_TUNNEL_POP: > case OVS_ACTION_ATTR_USERSPACE: > diff --git a/lib/odp-util.c b/lib/odp-util.c > index d6cb9e6..3472a8c 100644 > --- a/lib/odp-util.c > +++ b/lib/odp-util.c > @@ -119,6 +119,7 @@ odp_action_len(uint16_t type) > > switch ((enum ovs_action_attr) type) { > case OVS_ACTION_ATTR_OUTPUT: return sizeof(uint32_t); > + case OVS_ACTION_ATTR_LB_OUTPUT: return sizeof(uint32_t); > case OVS_ACTION_ATTR_TRUNC: return sizeof(struct ovs_action_trunc); > case OVS_ACTION_ATTR_TUNNEL_PUSH: return ATTR_LEN_VARIABLE; > case OVS_ACTION_ATTR_TUNNEL_POP: return sizeof(uint32_t); > @@ -1122,6 +1123,9 @@ format_odp_action(struct ds *ds, const struct nlattr *a, > case OVS_ACTION_ATTR_OUTPUT: > odp_portno_name_format(portno_names, nl_attr_get_odp_port(a), ds); > break; > + case OVS_ACTION_ATTR_LB_OUTPUT: > + ds_put_format(ds, "lb_output(bond,%"PRIu32")", nl_attr_get_u32(a)); > + break; > case OVS_ACTION_ATTR_TRUNC: { > const struct ovs_action_trunc *trunc = > nl_attr_get_unspec(a, sizeof *trunc); > diff --git a/ofproto/bond.c b/ofproto/bond.c > index 405202f..d540635 100644 > --- a/ofproto/bond.c > +++ b/ofproto/bond.c > @@ -54,10 +54,6 @@ static struct ovs_rwlock rwlock = OVS_RWLOCK_INITIALIZER; > static struct hmap all_bonds__ = HMAP_INITIALIZER(&all_bonds__); > static struct hmap *const all_bonds OVS_GUARDED_BY(rwlock) = &all_bonds__; > > -/* Bit-mask for hashing a flow down to a bucket. */ > -#define BOND_MASK 0xff > -#define BOND_BUCKETS (BOND_MASK + 1) > - > /* Priority for internal rules created to handle recirculation */ > #define RECIRC_RULE_PRIORITY 20 > > @@ -126,6 +122,8 @@ struct bond { > enum lacp_status lacp_status; /* Status of LACP negotiations. */ > bool bond_revalidate; /* True if flows need revalidation. */ > uint32_t basis; /* Basis for flow hash function. */ > + bool use_bond_cache; /* Use bond cache to avoid recirculation. > + Applicable only for Balance TCP mode. */ > > /* SLB specific bonding info. */ > struct bond_entry *hash; /* An array of BOND_BUCKETS elements. */ > @@ -185,7 +183,7 @@ static struct bond_slave *choose_output_slave(const > struct bond *, > struct flow_wildcards *, > uint16_t vlan) > OVS_REQ_RDLOCK(rwlock); > -static void update_recirc_rules__(struct bond *bond); > +static void update_recirc_rules__(struct bond *bond, uint32_t > bond_recirc_id); > static bool bond_is_falling_back_to_ab(const struct bond *); > > /* Attempts to parse 's' as the name of a bond balancing mode. If > successful, > @@ -262,6 +260,7 @@ void > bond_unref(struct bond *bond) > { > struct bond_slave *slave; > + uint32_t bond_recirc_id = 0; > > if (!bond || ovs_refcount_unref_relaxed(&bond->ref_cnt) != 1) { > return; > @@ -282,12 +281,13 @@ bond_unref(struct bond *bond) > > /* Free bond resources. Remove existing post recirc rules. */ > if (bond->recirc_id) { > + bond_recirc_id = bond->recirc_id; > recirc_free_id(bond->recirc_id); > bond->recirc_id = 0; > } > free(bond->hash); > bond->hash = NULL; > - update_recirc_rules__(bond); > + update_recirc_rules__(bond, bond_recirc_id); > > hmap_destroy(&bond->pr_rule_ops); > free(bond->name); > @@ -328,13 +328,14 @@ add_pr_rule(struct bond *bond, const struct match > *match, > * lock annotation. Currently, only 'bond_unref()' calls > * this function directly. */ > static void > -update_recirc_rules__(struct bond *bond) > +update_recirc_rules__(struct bond *bond, uint32_t bond_recirc_id) > { > struct match match; > struct bond_pr_rule_op *pr_op, *next_op; > uint64_t ofpacts_stub[128 / 8]; > struct ofpbuf ofpacts; > int i; > + ofp_port_t slave_map[BOND_MASK]; > > ofpbuf_use_stub(&ofpacts, ofpacts_stub, sizeof ofpacts_stub); > > @@ -353,8 +354,18 @@ update_recirc_rules__(struct bond *bond) > > add_pr_rule(bond, &match, slave->ofp_port, > &bond->hash[i].pr_rule); > + slave_map[i] = slave->ofp_port; > + } else { > + slave_map[i] = OFP_PORT_C(-1); > } > } > + if (bond->ofproto->backer->rt_support.lb_output_action) { > + ofproto_dpif_bundle_add(bond->ofproto, bond->recirc_id, > slave_map); > + } > + } else { > + if (bond->ofproto->backer->rt_support.lb_output_action) { > + ofproto_dpif_bundle_del(bond->ofproto, bond_recirc_id); > + } > } > > HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) { > @@ -404,7 +415,7 @@ static void > update_recirc_rules(struct bond *bond) > OVS_REQ_RDLOCK(rwlock) > { > - update_recirc_rules__(bond); > + update_recirc_rules__(bond, bond->recirc_id); > } > > /* Updates 'bond''s overall configuration to 's'. > @@ -467,6 +478,10 @@ bond_reconfigure(struct bond *bond, const struct > bond_settings *s) > recirc_free_id(bond->recirc_id); > bond->recirc_id = 0; > } > + if (bond->use_bond_cache != s->use_bond_cache) { > + bond->use_bond_cache = s->use_bond_cache; > + revalidate = true; > + } > > if (bond->balance == BM_AB || !bond->hash || revalidate) { > bond_entry_reset(bond); > @@ -944,6 +959,15 @@ bond_recirculation_account(struct bond *bond) > OVS_REQ_WRLOCK(rwlock) > { > int i; > + uint64_t n_bytes[BOND_BUCKETS] = {0}; > + > + if (bond->hash && bond->recirc_id) { > + /* Retrieve bond stats from datapath. */ > + if (bond->ofproto->backer->rt_support.lb_output_action) { > + dpif_bond_stats_get(bond->ofproto->backer->dpif, > + bond->recirc_id, n_bytes); > + } > + } > > for (i=0; i<=BOND_MASK; i++) { > struct bond_entry *entry = &bond->hash[i]; > @@ -953,8 +977,12 @@ bond_recirculation_account(struct bond *bond) > struct pkt_stats stats; > long long int used OVS_UNUSED; > > - rule->ofproto->ofproto_class->rule_get_stats( > - rule, &stats, &used); > + if (!bond->ofproto->backer->rt_support.lb_output_action) { > + rule->ofproto->ofproto_class->rule_get_stats( > + rule, &stats, &used); > + } else { > + stats.n_bytes = n_bytes[i]; > + } > bond_entry_account(entry, stats.n_bytes); > } > } > @@ -1365,6 +1393,8 @@ bond_print_details(struct ds *ds, const struct bond > *bond) > may_recirc ? "yes" : "no", may_recirc ? recirc_id: -1); > > ds_put_format(ds, "bond-hash-basis: %"PRIu32"\n", bond->basis); > + ds_put_format(ds, "lb-output-action: %s\n", > + bond->use_bond_cache ? "enabled" : "disabled"); > > ds_put_format(ds, "updelay: %d ms\n", bond->updelay); > ds_put_format(ds, "downdelay: %d ms\n", bond->downdelay); > @@ -1942,3 +1972,9 @@ bond_get_changed_active_slave(const char *name, struct > eth_addr *mac, > > return false; > } > + > +bool > +bond_get_cache_mode(const struct bond *bond) > +{ > + return bond->use_bond_cache; > +} > diff --git a/ofproto/bond.h b/ofproto/bond.h > index e7c3d9b..88a4de1 100644 > --- a/ofproto/bond.h > +++ b/ofproto/bond.h > @@ -22,6 +22,10 @@ > #include "ofproto-provider.h" > #include "packets.h" > > +/* Bit-mask for hashing a flow down to a bucket. */ > +#define BOND_MASK 0xff > +#define BOND_BUCKETS (BOND_MASK + 1) > + > struct flow; > struct netdev; > struct ofpbuf; > @@ -58,6 +62,8 @@ struct bond_settings { > /* The MAC address of the interface > that was active during the last > ovs run. */ > + bool use_bond_cache; /* Use bond cache. Only applicable for > + bond mode BALANCE TCP. */ > }; > > /* Program startup. */ > @@ -122,4 +128,7 @@ void bond_rebalance(struct bond *); > */ > void bond_update_post_recirc_rules(struct bond *, uint32_t *recirc_id, > uint32_t *hash_basis); > + > +bool bond_get_cache_mode(const struct bond *); > + > #endif /* bond.h */ > diff --git a/ofproto/ofproto-dpif-ipfix.c b/ofproto/ofproto-dpif-ipfix.c > index b413768..5030978 100644 > --- a/ofproto/ofproto-dpif-ipfix.c > +++ b/ofproto/ofproto-dpif-ipfix.c > @@ -3017,6 +3017,7 @@ dpif_ipfix_read_actions(const struct flow *flow, > case OVS_ACTION_ATTR_CHECK_PKT_LEN: > case OVS_ACTION_ATTR_UNSPEC: > case OVS_ACTION_ATTR_DROP: > + case OVS_ACTION_ATTR_LB_OUTPUT: > case __OVS_ACTION_ATTR_MAX: > default: > break; > diff --git a/ofproto/ofproto-dpif-sflow.c b/ofproto/ofproto-dpif-sflow.c > index f9ea47a..21dafa6 100644 > --- a/ofproto/ofproto-dpif-sflow.c > +++ b/ofproto/ofproto-dpif-sflow.c > @@ -1177,6 +1177,7 @@ dpif_sflow_read_actions(const struct flow *flow, > case OVS_ACTION_ATTR_CT: > case OVS_ACTION_ATTR_CT_CLEAR: > case OVS_ACTION_ATTR_METER: > + case OVS_ACTION_ATTR_LB_OUTPUT: > break; > > case OVS_ACTION_ATTR_SET_MASKED: > diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c > index 4407f9c..0446387 100644 > --- a/ofproto/ofproto-dpif-xlate.c > +++ b/ofproto/ofproto-dpif-xlate.c > @@ -409,6 +409,8 @@ struct xlate_ctx { > struct ofpbuf action_set; /* Action set. */ > > enum xlate_error error; /* Translation failed. */ > + > + bool tnl_push_no_recirc; /* Tunnel push recirculation status */ > }; > > /* Structure to track VLAN manipulation */ > @@ -3727,12 +3729,16 @@ native_tunnel_output(struct xlate_ctx *ctx, const > struct xport *xport, > ctx->xin->allow_side_effects = backup_side_effects; > ctx->xin->packet = backup_packet; > ctx->wc = backup_wc; > + > + ctx->tnl_push_no_recirc = true; > } else { > /* In order to maintain accurate stats, use recirc for > * natvie tunneling. */ > nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC, 0); > nl_msg_end_nested(ctx->odp_actions, clone_ofs); > - } > + > + ctx->tnl_push_no_recirc = false; > + } > > /* Restore the flows after the translation. */ > memcpy(&ctx->xin->flow, &old_flow, sizeof ctx->xin->flow); > @@ -4183,7 +4189,6 @@ compose_output_action__(struct xlate_ctx *ctx, > ofp_port_t ofp_port, > xlate_commit_actions(ctx); > > if (xr) { > - /* Recirculate the packet. */ > struct ovs_action_hash *act_hash; > > /* Hash action. */ > @@ -4192,15 +4197,30 @@ compose_output_action__(struct xlate_ctx *ctx, > ofp_port_t ofp_port, > /* Algorithm supported by all datapaths. */ > hash_alg = OVS_HASH_ALG_L4; > } > - act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions, > + > + if ((xr->hash_alg == OVS_HASH_ALG_L4) && > + (ctx->xbridge->support.lb_output_action) && > + !ctx->tnl_push_no_recirc && > + bond_get_cache_mode(xport->xbundle->bond)) { > + /* > + * If bond mode is balance-tcp and optimize balance tcp > + * is enabled then use the hash directly for slave selection > + * and avoid recirculation. > + * > + * Currently support for netdev datapath only. > + */ > + nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_LB_OUTPUT, > + xr->recirc_id); > + } else { > + act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions, > OVS_ACTION_ATTR_HASH, > sizeof *act_hash); > - act_hash->hash_alg = hash_alg; > - act_hash->hash_basis = xr->hash_basis; > - > - /* Recirc action. */ > - nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC, > - xr->recirc_id); > + act_hash->hash_alg = hash_alg; > + act_hash->hash_basis = xr->hash_basis; > + /* Recirculate the packet. */ > + nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC, > + xr->recirc_id); > + } > } else if (is_native_tunnel) { > /* Output to native tunnel port. */ > native_tunnel_output(ctx, xport, flow, odp_port, truncate); > @@ -7261,7 +7281,8 @@ count_output_actions(const struct ofpbuf *odp_actions) > int n = 0; > > NL_ATTR_FOR_EACH_UNSAFE (a, left, odp_actions->data, odp_actions->size) { > - if (a->nla_type == OVS_ACTION_ATTR_OUTPUT) { > + if ((a->nla_type == OVS_ACTION_ATTR_OUTPUT) || > + (a->nla_type == OVS_ACTION_ATTR_LB_OUTPUT)) { > n++; > } > } > @@ -7458,6 +7479,7 @@ xlate_actions(struct xlate_in *xin, struct xlate_out > *xout) > > .action_set_has_group = false, > .action_set = OFPBUF_STUB_INITIALIZER(action_set_stub), > + .tnl_push_no_recirc = false, > }; > > /* 'base_flow' reflects the packet as it came in, but we need it to > reflect > diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c > index f782f86..3c5438b 100644 > --- a/ofproto/ofproto-dpif.c > +++ b/ofproto/ofproto-dpif.c > @@ -1580,6 +1580,8 @@ check_support(struct dpif_backer *backer) > backer->rt_support.ct_timeout = check_ct_timeout_policy(backer); > backer->rt_support.explicit_drop_action = > dpif_supports_explicit_drop_action(backer->dpif); > + backer->rt_support.lb_output_action= > + dpif_supports_lb_output_action(backer->dpif); > > /* Flow fields. */ > backer->rt_support.odp.ct_state = check_ct_state(backer); > @@ -3433,6 +3435,27 @@ bundle_remove(struct ofport *port_) > } > } > > +int > +ofproto_dpif_bundle_add(struct ofproto_dpif *ofproto, > + uint32_t bond_id, const ofp_port_t slave_map[]) > +{ > + /* Convert ofp_port to odp_port */ > + odp_port_t odp_map[BOND_BUCKETS]; > + for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) { > + odp_map[bucket] = (slave_map[bucket] == OFP_PORT_C(-1) > + ? ODP_PORT_C(-1) > + : ofp_port_to_odp_port(ofproto, > slave_map[bucket])); > + } > + > + return dpif_bond_add(ofproto->backer->dpif, bond_id, odp_map); > +} > + > +int > +ofproto_dpif_bundle_del(struct ofproto_dpif *ofproto, uint32_t bond_id) > +{ > + return dpif_bond_del(ofproto->backer->dpif, bond_id); > +} > + > static void > send_pdu_cb(void *port_, const void *pdu, size_t pdu_size) > { > @@ -5563,6 +5586,7 @@ get_datapath_cap(const char *datapath_type, struct smap > *cap) > smap_add(cap, "ct_timeout", s.ct_timeout ? "true" : "false"); > smap_add(cap, "explicit_drop_action", > s.explicit_drop_action ? "true" :"false"); > + smap_add(cap, "lb_output_action", s.lb_output_action ? "true" : "false"); > } > > /* Gets timeout policy name in 'backer' based on 'zone', 'dl_type' and > diff --git a/ofproto/ofproto-dpif.h b/ofproto/ofproto-dpif.h > index c9d5df3..384a269 100644 > --- a/ofproto/ofproto-dpif.h > +++ b/ofproto/ofproto-dpif.h > @@ -202,8 +202,10 @@ struct group_dpif *group_dpif_lookup(struct ofproto_dpif > *, > DPIF_SUPPORT_FIELD(bool, ct_timeout, "Conntrack timeout policy") \ > \ > /* True if the datapath supports explicit drop action. */ \ > - DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop action") > - > + DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop action") \ > + \ > + /* True if the datapath supports balance_tcp optimization */ \ > + DPIF_SUPPORT_FIELD(bool, lb_output_action, "Optimized Balance TCP mode") > > /* Stores the various features which the corresponding backer supports. */ > struct dpif_backer_support { > @@ -379,6 +381,9 @@ int ofproto_dpif_add_internal_flow(struct ofproto_dpif *, > struct rule **rulep); > int ofproto_dpif_delete_internal_flow(struct ofproto_dpif *, struct match *, > int priority); > +int ofproto_dpif_bundle_add(struct ofproto_dpif *, uint32_t bond_id, > + const ofp_port_t slave_map[]); > +int ofproto_dpif_bundle_del(struct ofproto_dpif *, uint32_t bond_id); > > bool ovs_native_tunneling_is_on(struct ofproto_dpif *); > > diff --git a/tests/lacp.at b/tests/lacp.at > index 7b460d7..c881a0b 100644 > --- a/tests/lacp.at > +++ b/tests/lacp.at > @@ -121,6 +121,7 @@ AT_CHECK([ovs-appctl bond/show], [0], [dnl > bond_mode: active-backup > bond may use recirculation: no, Recirc-ID : -1 > bond-hash-basis: 0 > +lb-output-action: disabled > updelay: 0 ms > downdelay: 0 ms > lacp_status: negotiated > @@ -286,6 +287,7 @@ slave: p3: current attached > bond_mode: balance-tcp > bond may use recirculation: yes, <del> > bond-hash-basis: 0 > +lb-output-action: disabled > updelay: 0 ms > downdelay: 0 ms > lacp_status: negotiated > @@ -301,6 +303,7 @@ slave p1: enabled > bond_mode: balance-tcp > bond may use recirculation: yes, <del> > bond-hash-basis: 0 > +lb-output-action: disabled > updelay: 0 ms > downdelay: 0 ms > lacp_status: negotiated > @@ -423,6 +426,7 @@ slave: p3: current attached > bond_mode: balance-tcp > bond may use recirculation: yes, <del> > bond-hash-basis: 0 > +lb-output-action: disabled > updelay: 0 ms > downdelay: 0 ms > lacp_status: negotiated > @@ -440,6 +444,7 @@ slave p1: enabled > bond_mode: balance-tcp > bond may use recirculation: yes, <del> > bond-hash-basis: 0 > +lb-output-action: disabled > updelay: 0 ms > downdelay: 0 ms > lacp_status: negotiated > @@ -555,6 +560,7 @@ slave: p3: current attached > bond_mode: balance-tcp > bond may use recirculation: yes, <del> > bond-hash-basis: 0 > +lb-output-action: disabled > updelay: 0 ms > downdelay: 0 ms > lacp_status: negotiated > @@ -572,6 +578,7 @@ slave p1: enabled > bond_mode: balance-tcp > bond may use recirculation: yes, <del> > bond-hash-basis: 0 > +lb-output-action: disabled > updelay: 0 ms > downdelay: 0 ms > lacp_status: negotiated > @@ -692,6 +699,7 @@ slave: p3: current attached > bond_mode: balance-tcp > bond may use recirculation: yes, <del> > bond-hash-basis: 0 > +lb-output-action: disabled > updelay: 0 ms > downdelay: 0 ms > lacp_status: negotiated > @@ -709,6 +717,7 @@ slave p1: enabled > bond_mode: balance-tcp > bond may use recirculation: yes, <del> > bond-hash-basis: 0 > +lb-output-action: disabled > updelay: 0 ms > downdelay: 0 ms > lacp_status: negotiated > diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c > index 86c7b10..d4f34ad 100644 > --- a/vswitchd/bridge.c > +++ b/vswitchd/bridge.c > @@ -4589,6 +4589,12 @@ port_configure_bond(struct port *port, struct > bond_settings *s) > /* OVSDB did not store the last active interface */ > s->active_slave_mac = eth_addr_zero; > } > + /* Default set bond cache use to false. */ > + s->use_bond_cache = false; > + if (s->balance == BM_TCP) { > + s->use_bond_cache = smap_get_bool(&port->cfg->other_config, > + "lb-output-action", false); > + } > } > > /* Returns true if 'port' is synthetic, that is, if we constructed it locally > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml > index 0ec726c..bdc7ef4 100644 > --- a/vswitchd/vswitch.xml > +++ b/vswitchd/vswitch.xml > @@ -1962,6 +1962,16 @@ > <code>active-backup</code>. > </column> > > + <column name="other_config" key="lb-output-action" > + type='{"type": "boolean"}'> > + Enable/disable usage of optimized lb-output-action for > + balancing flows among output slaves in load balanced bonds in > + <code>balance-tcp</code>. When enabled, it uses optimized path for > + balance-tcp mode by using rss hash and avoids recirculation. > + It affects only new flows, i.e, existing flows remain unchanged. > + This knob does not affect other balancing modes. > + </column> > + > <group title="Link Failure Detection"> > <p> > An important part of link bonding is detecting that links are down > so > @@ -5722,6 +5732,19 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 > type=patch options:peer=p1 \ > higher performance for MPLS and active-active load balancing > bonding modes. > </column> > + <column name="capabilities" key="lb_output_action" > + type='{"type": "boolean"}'> > + If this is true, then the datapath supports optimized balance-tcp > + bond mode. This capability replaces existing <code>hash</code> and > + <code>recirc</code> actions with new action <code>lb_output</code> > + and avoids recirculation of packet in datapath. It is supported > + only for balance-tcp bond mode in netdev datapath. The new action > + gives higer performance but also requires pmd reloads whenever > + traffic load needs to be rebalanced among bonded links. By default > + this new action is disabled, however it can be enabled by setting > + <ref column="other-config" key="lb-output-action"/> in > + <ref table="Port"/> table. > + </column> > <group title="Connection-Tracking Capabilities"> > <p> > These capabilities are granular because Open vSwitch and its > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev