On 08.01.2020 06:18, Vishal Deep Ajmera wrote:
> Problem:
> --------
> In OVS-DPDK, flows with output over a bond interface of type “balance-tcp”
> (using a hash on TCP/UDP 5-tuple) get translated by the ofproto layer into
> "HASH" and "RECIRC" datapath actions. After recirculation, the packet is
> forwarded to the bond member port based on 8-bits of the datapath hash
> value computed through dp_hash. This causes performance degradation in the
> following ways:
>
> 1. The recirculation of the packet implies another lookup of the packet’s
> flow key in the exact match cache (EMC) and potentially Megaflow classifier
> (DPCLS). This is the biggest cost factor.
>
> 2. The recirculated packets have a new “RSS” hash and compete with the
> original packets for the scarce number of EMC slots. This implies more
> EMC misses and potentially EMC thrashing causing costly DPCLS lookups.
>
> 3. The 256 extra megaflow entries per bond for dp_hash bond selection put
> additional load on the revalidation threads.
>
> Owing to this performance degradation, deployments stick to “balance-slb”
> bond mode even though it does not do active-active load balancing for
> VXLAN- and GRE-tunnelled traffic because all tunnel packet have the same
> source MAC address.
>
> Proposed optimization:
> ----------------------
> This proposal introduces a new load-balancing output action instead of
> recirculation.
>
> Maintain one table per-bond (could just be an array of uint16's) and
> program it the same way internal flows are created today for each possible
> hash value(256 entries) from ofproto layer. Use this table to load-balance
> flows as part of output action processing.
>
> Currently xlate_normal() -> output_normal() -> bond_update_post_recirc_rules()
> -> bond_may_recirc() and compose_output_action__() generate
> “dp_hash(hash_l4(0))” and “recirc(<RecircID>)” actions. In this case the
> RecircID identifies the bond. For the recirculated packets the ofproto layer
> installs megaflow entries that match on RecircID and masked dp_hash and send
> them to the corresponding output port.
>
> Instead, we will now generate action as
> "lb_output(bond,<bond id>)"
>
> This combines hash computation (only if needed, else re-use RSS hash) and
> inline load-balancing over the bond. This action is used *only* for
> balance-tcp
> bonds in OVS-DPDK datapath (the OVS kernel datapath remains unchanged).
>
> Example:
> --------
> Current scheme:
> ---------------
> With 1 IP-UDP flow:
>
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no),
> packets:2828969, bytes:181054016, used:0.000s,
> actions:hash(hash_l4(0)),recirc(0x1)
>
> recirc_id(0x1),dp_hash(0x113683bd/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
> packets:2828937, bytes:181051968, used:0.000s, actions:2
>
> With 8 IP-UDP flows (with random UDP src port): (all hitting same DPCL):
>
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no),
> packets:2674009, bytes:171136576, used:0.000s,
> actions:hash(hash_l4(0)),recirc(0x1)
>
> recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
> packets:377395, bytes:24153280, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb236c260/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
> packets:333486, bytes:21343104, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0x7d89eb18/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
> packets:348461, bytes:22301504, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0xa78d75df/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
> packets:633353, bytes:40534592, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb58d846f/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
> packets:319901, bytes:20473664, used:0.001s, actions:2
> recirc_id(0x1),dp_hash(0x24534406/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
> packets:334985, bytes:21439040, used:0.001s, actions:1
> recirc_id(0x1),dp_hash(0x3cf32550/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
> packets:326404, bytes:20889856, used:0.001s, actions:1
>
> New scheme:
> -----------
> We can do with a single flow entry (for any number of new flows):
>
> in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no),
> packets:2674009, bytes:171136576, used:0.000s, actions:lb_output(bond,1)
>
> A new CLI has been added to dump datapath bond cache as given below.
>
> “sudo ovs-appctl dpif-netdev/dp-bond-show [dp]”
>
> root@ubuntu-190:performance_scripts # sudo ovs-appctl dpif-netdev/dp-bond-show
> Bond cache:
> bond-id 1 :
> bucket 0 - slave 2
> bucket 1 - slave 1
> bucket 2 - slave 2
> bucket 3 - slave 1
>
> Performance improvement:
> ------------------------
> With a prototype of the proposed idea, the following perf improvement is seen
> with Phy-VM-Phy UDP traffic, single flow. With multiple flows, the improvement
> is even more enhanced (due to reduced number of flows).
>
> 1 VM:
> *****
> +--------------------------------------+
> | mpps |
> +--------------------------------------+
> | Flows master with-opt. %delta |
> +--------------------------------------+
> | 1 4.47 5.73 28.21
> | 10 4.17 5.35 28.45
> | 400 3.52 5.39 53.01
> | 1k 3.41 5.25 53.78
> | 10k 2.53 4.57 80.44
> | 100k 2.32 4.27 83.59
> | 500k 2.32 4.27 83.59
> +--------------------------------------+
> mpps: million packets per second.
> packet size 64 bytes.
>
> Signed-off-by: Manohar Krishnappa Chidambaraswamy <[email protected]>
> Co-authored-by: Manohar Krishnappa Chidambaraswamy <[email protected]>
> Signed-off-by: Vishal Deep Ajmera <[email protected]>
>
> CC: Jan Scheurich <[email protected]>
> CC: Venkatesan Pradeep <[email protected]>
> CC: Nitin Katiyar <[email protected]>
> ---
> datapath/linux/compat/include/linux/openvswitch.h | 1 +
> lib/dpif-netdev.c | 502
> ++++++++++++++++++++--
> lib/dpif-netlink.c | 3 +
> lib/dpif-provider.h | 9 +
> lib/dpif.c | 49 +++
> lib/dpif.h | 7 +
> lib/odp-execute.c | 2 +
> lib/odp-util.c | 4 +
> ofproto/bond.c | 56 ++-
> ofproto/bond.h | 9 +
> ofproto/ofproto-dpif-ipfix.c | 1 +
> ofproto/ofproto-dpif-sflow.c | 1 +
> ofproto/ofproto-dpif-xlate.c | 42 +-
> ofproto/ofproto-dpif.c | 24 ++
> ofproto/ofproto-dpif.h | 9 +-
> tests/lacp.at | 9 +
> vswitchd/bridge.c | 6 +
> vswitchd/vswitch.xml | 23 +
> 18 files changed, 695 insertions(+), 62 deletions(-)
I started reviewing, but stopped at some point because I realized
that current solution is over-complicated (too much of unnecessary
memory copies) and not thread safe.
See some comments inline.
I'll follow up with suggestions on how to implement this functionality
in a different way, since I don't think that it's possible to make
current version work correctly.
NACK for dpif-netdev part.
Best regards, Ilya Maximets.
>
> diff --git a/datapath/linux/compat/include/linux/openvswitch.h
> b/datapath/linux/compat/include/linux/openvswitch.h
> index 2f0c655..7604a2a 100644
> --- a/datapath/linux/compat/include/linux/openvswitch.h
> +++ b/datapath/linux/compat/include/linux/openvswitch.h
> @@ -1021,6 +1021,7 @@ enum ovs_action_attr {
> OVS_ACTION_ATTR_TUNNEL_PUSH, /* struct ovs_action_push_tnl*/
> OVS_ACTION_ATTR_TUNNEL_POP, /* u32 port number. */
> OVS_ACTION_ATTR_DROP, /* u32 xlate_error. */
> + OVS_ACTION_ATTR_LB_OUTPUT, /* u32 bond-id */
Please, add the period at the end of comment.
> #endif
> __OVS_ACTION_ATTR_MAX, /* Nothing past this will be accepted
> * from userspace. */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 2421821..78f9cf9 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -79,6 +79,7 @@
> #include "unixctl.h"
> #include "util.h"
> #include "uuid.h"
> +#include "ofproto/bond.h"
This doesn't look good. Please, avoid including 'ofproto' headers
from the 'lib' code.
>
> VLOG_DEFINE_THIS_MODULE(dpif_netdev);
>
> @@ -377,6 +378,11 @@ struct dp_netdev {
>
> struct conntrack *conntrack;
> struct pmd_auto_lb pmd_alb;
As Ben suggested, empty line here.
> + /* Bonds.
> + *
> + * Any lookup into 'bonds' requires taking 'bond_mutex'. */
> + struct ovs_mutex bond_mutex;
> + struct hmap bonds;
As Ben suggested, comment here is required.
> };
>
> static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
> @@ -607,6 +613,20 @@ struct tx_port {
> struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
> };
>
> +/* Contained by struct tx_bond 'slave_buckets' */
> +struct slave_entry {
> + odp_port_t slave_id;
> + atomic_ullong n_packets;
> + atomic_ullong n_bytes;
> +};
> +
> +/* Contained by struct dp_netdev_pmd_thread's 'bond_cache' or 'tx_bonds'. */
> +struct tx_bond {
> + struct hmap_node node;
> + uint32_t bond_id;
> + struct slave_entry slave_buckets[BOND_BUCKETS];
> +};
> +
> /* A set of properties for the current processing loop that is not directly
> * associated with the pmd thread itself, but with the packets being
> * processed or the short-term system configuration (for example, time).
> @@ -718,7 +738,6 @@ struct dp_netdev_pmd_thread {
> atomic_bool wait_for_reload; /* Can we busy wait for the next reload?
> */
> atomic_bool reload_tx_qid; /* Do we need to reload static_tx_qid? */
> atomic_bool exit; /* For terminating the pmd thread. */
> -
Unrelated change. Please, keep the line.
> pthread_t thread;
> unsigned core_id; /* CPU core id of this pmd thread. */
> int numa_id; /* numa node id of this pmd thread. */
> @@ -739,6 +758,11 @@ struct dp_netdev_pmd_thread {
> * read by the pmd thread. */
> struct hmap tx_ports OVS_GUARDED;
>
> + struct ovs_mutex bond_mutex; /* Mutex for 'tx_bonds'. */
> + /* Map of 'tx_bond's used for transmission. Written by the main thread,
> + * read/written by the pmd thread. */
Why PMD needs to write into this hash map? Can we avoid this?
Sounds dangerous.
> + struct hmap tx_bonds OVS_GUARDED;
> +
> /* These are thread-local copies of 'tx_ports'. One contains only tunnel
> * ports (that support push_tunnel/pop_tunnel), the other contains ports
> * with at least one txq (that support send). A port can be in both.
> @@ -751,6 +775,8 @@ struct dp_netdev_pmd_thread {
> * other instance will only be accessed by its own pmd thread. */
> struct hmap tnl_port_cache;
> struct hmap send_port_cache;
> + /* These are thread-local copies of 'tx_bonds' */
> + struct hmap bond_cache;
>
> /* Keep track of detailed PMD performance statistics. */
> struct pmd_perf_stats perf_stats;
> @@ -830,6 +856,12 @@ static void dp_netdev_del_rxq_from_pmd(struct
> dp_netdev_pmd_thread *pmd,
> static int
> dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
> bool force);
> +static void dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
> + struct tx_bond *bond)
> + OVS_REQUIRES(pmd->bond_mutex);
> +static void dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
> + struct tx_bond *tx)
> + OVS_REQUIRES(pmd->bond_mutex);
>
> static void reconfigure_datapath(struct dp_netdev *dp)
> OVS_REQUIRES(dp->port_mutex);
> @@ -838,6 +870,10 @@ static void dp_netdev_pmd_unref(struct
> dp_netdev_pmd_thread *pmd);
> static void dp_netdev_pmd_flow_flush(struct dp_netdev_pmd_thread *pmd);
> static void pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
> OVS_REQUIRES(pmd->port_mutex);
> +static void pmd_load_cached_bonds(struct dp_netdev_pmd_thread *pmd)
> + OVS_REQUIRES(pmd->bond_mutex);
> +static void pmd_load_bond_cache(struct dp_netdev_pmd_thread *pmd);
> +
> static inline void
> dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
> struct polled_queue *poll_list, int poll_cnt);
> @@ -1396,6 +1432,45 @@ pmd_perf_show_cmd(struct unixctl_conn *conn, int argc,
> par.command_type = PMD_INFO_PERF_SHOW;
> dpif_netdev_pmd_info(conn, argc, argv, &par);
> }
> +
> +static void
> +dpif_netdev_dp_bond_show(struct unixctl_conn *conn, int argc,
> + const char *argv[], void *aux OVS_UNUSED)
> +{
> + struct ds reply = DS_EMPTY_INITIALIZER;
> +
> + ovs_mutex_lock(&dp_netdev_mutex);
> +
> + struct dp_netdev *dp = NULL;
> + if (argc == 2) {
> + dp = shash_find_data(&dp_netdevs, argv[1]);
> + } else if (shash_count(&dp_netdevs) == 1) {
> + /* There's only one datapath */
> + dp = shash_first(&dp_netdevs)->data;
> + }
> + if (!dp) {
> + ovs_mutex_unlock(&dp_netdev_mutex);
> + unixctl_command_reply_error(conn,
> + "please specify an existing datapath");
> + return;
> + }
> + ds_put_cstr(&reply, "\nBonds:\n");
> +
> + struct tx_bond *dp_bond_entry;
> + HMAP_FOR_EACH (dp_bond_entry, node, &dp->bonds) {
This requires holding dp->bond_mutex.
> + ds_put_format(&reply, "\tbond-id %u :\n",
> + dp_bond_entry->bond_id);
> + for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> + ds_put_format(&reply, "\t\tbucket %u - slave %u \n",
> + bucket,
> + dp_bond_entry->slave_buckets[bucket].slave_id);
> + }
> + }
> + ovs_mutex_unlock(&dp_netdev_mutex);
> + unixctl_command_reply(conn, ds_cstr(&reply));
> + ds_destroy(&reply);
> +}
> +
>
> static int
> dpif_netdev_init(void)
> @@ -1427,6 +1502,9 @@ dpif_netdev_init(void)
> "[-us usec] [-q qlen]",
> 0, 10, pmd_perf_log_set_cmd,
> NULL);
> + unixctl_command_register("dpif-netdev/dp-bond-show", "[dp]",
> + 0, 1, dpif_netdev_dp_bond_show,
> + NULL);
> return 0;
> }
>
> @@ -1551,6 +1629,9 @@ create_dp_netdev(const char *name, const struct
> dpif_class *class,
> ovs_mutex_init(&dp->port_mutex);
> hmap_init(&dp->ports);
> dp->port_seq = seq_create();
> + ovs_mutex_init(&dp->bond_mutex);
> + hmap_init(&dp->bonds);
> +
> fat_rwlock_init(&dp->upcall_rwlock);
>
> dp->reconfigure_seq = seq_create();
> @@ -1674,6 +1755,14 @@ dp_netdev_free(struct dp_netdev *dp)
> }
> ovs_mutex_unlock(&dp->port_mutex);
>
> + ovs_mutex_lock(&dp->bond_mutex);
> + struct tx_bond *bond, *next_bond;
> + HMAP_FOR_EACH_SAFE (bond, next_bond, node, &dp->bonds) {
> + hmap_remove(&dp->bonds, &bond->node);
> + free(bond);
> + }
> + ovs_mutex_unlock(&dp->bond_mutex);
> +
> dp_netdev_destroy_all_pmds(dp, true);
> cmap_destroy(&dp->poll_threads);
>
> @@ -1692,6 +1781,9 @@ dp_netdev_free(struct dp_netdev *dp)
> hmap_destroy(&dp->ports);
> ovs_mutex_destroy(&dp->port_mutex);
>
> + hmap_destroy(&dp->bonds);
> + ovs_mutex_destroy(&dp->bond_mutex);
> +
> /* Upcalls must be disabled at this point */
> dp_netdev_destroy_upcall_lock(dp);
>
> @@ -1795,6 +1887,7 @@ dp_netdev_reload_pmd__(struct dp_netdev_pmd_thread *pmd)
> ovs_mutex_lock(&pmd->port_mutex);
> pmd_load_cached_ports(pmd);
> ovs_mutex_unlock(&pmd->port_mutex);
> + pmd_load_bond_cache(pmd);
> ovs_mutex_unlock(&pmd->dp->non_pmd_mutex);
> return;
> }
> @@ -1809,6 +1902,12 @@ hash_port_no(odp_port_t port_no)
> return hash_int(odp_to_u32(port_no), 0);
> }
>
> +static uint32_t
> +hash_bond_id(uint32_t bond_id)
> +{
> + return hash_int(bond_id, 0);
> +}
> +
> static int
> port_create(const char *devname, const char *type,
> odp_port_t port_no, struct dp_netdev_port **portp)
> @@ -4349,6 +4448,19 @@ tx_port_lookup(const struct hmap *hmap, odp_port_t
> port_no)
> return NULL;
> }
>
> +static struct tx_bond *
> +tx_bond_lookup(const struct hmap *hmap, uint32_t bond_id)
> +{
> + struct tx_bond *tx;
> +
> + HMAP_FOR_EACH_IN_BUCKET (tx, node, hash_bond_id(bond_id), hmap) {
Why not HMAP_FOR_EACH_WITH_HASH ?
BTW, it's better to calculate hash before the loop to be sure that
we're not recalculating it on each iteration.
> + if (tx->bond_id == bond_id) {
> + return tx;
> + }
> + }
> + return NULL;
> +}
> +
> static int
> port_reconfigure(struct dp_netdev_port *port)
> {
> @@ -4837,6 +4949,26 @@ pmd_remove_stale_ports(struct dp_netdev *dp,
> ovs_mutex_unlock(&pmd->port_mutex);
> }
>
> +static void
> +pmd_remove_stale_bonds(struct dp_netdev *dp,
> + struct dp_netdev_pmd_thread *pmd)
> + OVS_EXCLUDED(pmd->bond_mutex)
> + OVS_EXCLUDED(dp->bond_mutex)
> +{
> + ovs_mutex_lock(&dp->bond_mutex);
> + ovs_mutex_lock(&pmd->bond_mutex);
> +
> + struct tx_bond *tx, *tx_next;
> + HMAP_FOR_EACH_SAFE (tx, tx_next, node, &pmd->tx_bonds) {
> + if (!tx_bond_lookup(&dp->bonds, tx->bond_id)) {
> + dp_netdev_del_bond_tx_from_pmd(pmd, tx);
> + }
> + }
> +
> + ovs_mutex_unlock(&pmd->bond_mutex);
> + ovs_mutex_unlock(&dp->bond_mutex);
> +}
> +
> /* Must be called each time a port is added/removed or the cmask changes.
> * This creates and destroys pmd threads, reconfigures ports, opens their
> * rxqs and assigns all rxqs/txqs to pmd threads. */
> @@ -4875,10 +5007,11 @@ reconfigure_datapath(struct dp_netdev *dp)
> }
> }
>
> - /* Remove from the pmd threads all the ports that have been deleted or
> - * need reconfiguration. */
> + /* Remove from the pmd threads all the ports/bonds that have been deleted
> + * or need reconfiguration. */
> CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> pmd_remove_stale_ports(dp, pmd);
> + pmd_remove_stale_bonds(dp, pmd);
> }
>
> /* Reload affected pmd threads. We must wait for the pmd threads before
> @@ -5000,6 +5133,21 @@ reconfigure_datapath(struct dp_netdev *dp)
> ovs_mutex_unlock(&pmd->port_mutex);
> }
>
> + /* Add every bond to the tx cache of every pmd thread, if it's not
> + * there already and if this pmd has at least one rxq to poll. */
> + ovs_mutex_lock(&dp->bond_mutex);
> + CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> + ovs_mutex_lock(&pmd->bond_mutex);
> + if (hmap_count(&pmd->poll_list) || pmd->core_id == NON_PMD_CORE_ID) {
> + struct tx_bond *bond;
> + HMAP_FOR_EACH (bond, node, &dp->bonds) {
> + dp_netdev_add_bond_tx_to_pmd(pmd, bond);
> + }
> + }
> + ovs_mutex_unlock(&pmd->bond_mutex);
> + }
> + ovs_mutex_unlock(&dp->bond_mutex);
> +
> /* Reload affected pmd threads. */
> reload_affected_pmds(dp);
>
> @@ -5258,7 +5406,6 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
> return ret;
> }
>
> -
Unrelated change.
> /* Return true if needs to revalidate datapath flows. */
> static bool
> dpif_netdev_run(struct dpif *dpif)
> @@ -5428,6 +5575,56 @@ pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
> }
>
> static void
> +pmd_free_cached_bonds(struct dp_netdev_pmd_thread *pmd)
> +{
> + struct tx_bond *bond, *next;
> +
> + /* Remove bonds from pmd which no longer exists. */
> + HMAP_FOR_EACH_SAFE (bond, next, node, &pmd->bond_cache) {
> + struct tx_bond *tx = NULL;
> +
> + tx = tx_bond_lookup(&pmd->tx_bonds, bond->bond_id);
This lookup requires holding pmd->bond_mutex.
> + if (!tx) {
> + /* Bond no longer exist. Delete it from pmd. */
> + hmap_remove(&pmd->bond_cache, &bond->node);
> + free(bond);
> + }
> + }
> +}
> +
> +/* Copies bonds from 'pmd->tx_bonds' (shared with the main thread) to
> + * 'pmd->bond_cache' (thread local) */
> +static void
> +pmd_load_cached_bonds(struct dp_netdev_pmd_thread *pmd)
> + OVS_REQUIRES(pmd->bond_mutex)
> +{
> + pmd_free_cached_bonds(pmd);
> + hmap_shrink(&pmd->bond_cache);
> +
> + struct tx_bond *tx_bond, *tx_bond_cached;
> + HMAP_FOR_EACH (tx_bond, node, &pmd->tx_bonds) {
> + /* Check if bond already exist on pmd. */
> + tx_bond_cached = tx_bond_lookup(&pmd->bond_cache, tx_bond->bond_id);
You can't do a lookup in PMDs local cache. Only PMD thread is allowed
to use it.
> +
> + if (!tx_bond_cached) {
> + /* Create new bond entry in cache. */
> + tx_bond_cached = xmemdup(tx_bond, sizeof *tx_bond_cached);
> + hmap_insert(&pmd->bond_cache, &tx_bond_cached->node,
> + hash_bond_id(tx_bond_cached->bond_id));
> + } else {
> + /* Update the slave-map. */
> + for (int bucket = 0; bucket <= BOND_MASK; bucket++) {
> + tx_bond_cached->slave_buckets[bucket].slave_id =
> + tx_bond->slave_buckets[bucket].slave_id;
> + }
> + }
> + VLOG_DBG("Caching bond-id %d pmd %d",
> + tx_bond_cached->bond_id, pmd->core_id);
> + }
> +}
> +
> +
> +static void
> pmd_alloc_static_tx_qid(struct dp_netdev_pmd_thread *pmd)
> {
> ovs_mutex_lock(&pmd->dp->tx_qid_pool_mutex);
> @@ -5449,6 +5646,14 @@ pmd_free_static_tx_qid(struct dp_netdev_pmd_thread
> *pmd)
> ovs_mutex_unlock(&pmd->dp->tx_qid_pool_mutex);
> }
>
> +static void
> +pmd_load_bond_cache(struct dp_netdev_pmd_thread *pmd)
> +{
> + ovs_mutex_lock(&pmd->bond_mutex);
> + pmd_load_cached_bonds(pmd);
> + ovs_mutex_unlock(&pmd->bond_mutex);
> +}
Can we avoid introducing of this function by moving mutex acquaring inside
the 'pmd_load_cached_bonds'. Function names are too similar. It's confusing.
> +
> static int
> pmd_load_queues_and_ports(struct dp_netdev_pmd_thread *pmd,
> struct polled_queue **ppoll_list)
> @@ -5476,6 +5681,8 @@ pmd_load_queues_and_ports(struct dp_netdev_pmd_thread
> *pmd,
>
> ovs_mutex_unlock(&pmd->port_mutex);
>
> + pmd_load_bond_cache(pmd);
> +
> *ppoll_list = poll_list;
> return i;
> }
> @@ -6038,6 +6245,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread
> *pmd, struct dp_netdev *dp,
> atomic_init(&pmd->reload, false);
> ovs_mutex_init(&pmd->flow_mutex);
> ovs_mutex_init(&pmd->port_mutex);
> + ovs_mutex_init(&pmd->bond_mutex);
> cmap_init(&pmd->flow_table);
> cmap_init(&pmd->classifiers);
> pmd->ctx.last_rxq = NULL;
> @@ -6048,6 +6256,8 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread
> *pmd, struct dp_netdev *dp,
> hmap_init(&pmd->tx_ports);
> hmap_init(&pmd->tnl_port_cache);
> hmap_init(&pmd->send_port_cache);
> + hmap_init(&pmd->tx_bonds);
> + hmap_init(&pmd->bond_cache);
> /* init the 'flow_cache' since there is no
> * actual thread created for NON_PMD_CORE_ID. */
> if (core_id == NON_PMD_CORE_ID) {
> @@ -6068,6 +6278,8 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
> hmap_destroy(&pmd->send_port_cache);
> hmap_destroy(&pmd->tnl_port_cache);
> hmap_destroy(&pmd->tx_ports);
> + hmap_destroy(&pmd->bond_cache);
> + hmap_destroy(&pmd->tx_bonds);
> hmap_destroy(&pmd->poll_list);
> /* All flows (including their dpcls_rules) have been deleted already */
> CMAP_FOR_EACH (cls, node, &pmd->classifiers) {
> @@ -6079,6 +6291,7 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
> ovs_mutex_destroy(&pmd->flow_mutex);
> seq_destroy(pmd->reload_seq);
> ovs_mutex_destroy(&pmd->port_mutex);
> + ovs_mutex_destroy(&pmd->bond_mutex);
> free(pmd);
> }
>
> @@ -6232,6 +6445,45 @@ dp_netdev_del_port_tx_from_pmd(struct
> dp_netdev_pmd_thread *pmd,
> free(tx);
> pmd->need_reload = true;
> }
> +
> +static void
> +dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
> + struct tx_bond *bond)
> + OVS_REQUIRES(pmd->bond_mutex)
> +{
> + bool reload = false;
> + struct tx_bond *tx = tx_bond_lookup(&pmd->tx_bonds, bond->bond_id);
> + if (tx) {
> + /* Check if mapping is changed. */
> + for (int i = 0; i <= BOND_MASK; i++) {
> + if (bond->slave_buckets[i].slave_id !=
> + tx->slave_buckets[i].slave_id) {
> + /* Mapping is modified. Reload pmd bond cache again. */
> + reload = true;
> + }
> + /* Copy the map always. */
> + tx->slave_buckets[i].slave_id = bond->slave_buckets[i].slave_id;
> + }
> + } else {
> + tx = xmemdup(bond, sizeof *tx);
> + hmap_insert(&pmd->tx_bonds, &tx->node, hash_bond_id(bond->bond_id));
> + reload = true;
> + }
> + if (reload) {
> + pmd->need_reload = true;
> + }
> +}
> +
> +/* Del 'tx' from the tx bond cache of 'pmd' */
> +static void
> +dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
> + struct tx_bond *tx)
> + OVS_REQUIRES(pmd->bond_mutex)
> +{
> + hmap_remove(&pmd->tx_bonds, &tx->node);
> + free(tx);
> + pmd->need_reload = true;
> +}
>
> static char *
> dpif_netdev_get_datapath_version(void)
> @@ -7007,6 +7259,13 @@ pmd_send_port_cache_lookup(const struct
> dp_netdev_pmd_thread *pmd,
> return tx_port_lookup(&pmd->send_port_cache, port_no);
> }
>
> +static struct tx_bond *
> +pmd_tx_bond_cache_lookup(const struct dp_netdev_pmd_thread *pmd,
> + uint32_t bond_id)
> +{
> + return tx_bond_lookup(&pmd->bond_cache, bond_id);
> +}
> +
> static int
> push_tnl_action(const struct dp_netdev_pmd_thread *pmd,
> const struct nlattr *attr,
> @@ -7057,6 +7316,91 @@ dp_execute_userspace_action(struct
> dp_netdev_pmd_thread *pmd,
> }
> }
>
> +static bool
> +dp_execute_output_action(struct dp_netdev_pmd_thread *pmd,
> + struct dp_packet_batch *packets_,
> + bool should_steal, odp_port_t port_no)
> +{
> + struct tx_port *p = pmd_send_port_cache_lookup(pmd, port_no);
> + if (!OVS_LIKELY(p)) {
OVS_UNLIKELY(!p)
> + return false;
> + }
> +
> + struct dp_packet_batch out;
> + if (!should_steal) {
> + dp_packet_batch_clone(&out, packets_);
> + dp_packet_batch_reset_cutlen(packets_);
> + packets_ = &out;
> + }
> + dp_packet_batch_apply_cutlen(packets_);
> +#ifdef DPDK_NETDEV
> + if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
> + && packets_->packets[0]->source
> + != p->output_pkts.packets[0]->source)) {
Plaese, align this the same way as it was aligned before.
> + /* netdev-dpdk assumes that all packets in a single
> + * output batch has the same source. Flush here to
> + * avoid memory access issues. */
> + dp_netdev_pmd_flush_output_on_port(pmd, p);
> + }
> +#endif
> + if (dp_packet_batch_size(&p->output_pkts)
> + + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
> + /* Flush here to avoid overflow. */
> + dp_netdev_pmd_flush_output_on_port(pmd, p);
> + }
> + if (dp_packet_batch_is_empty(&p->output_pkts)) {
> + pmd->n_output_batches++;
> + }
> +
> + struct dp_packet *packet;
> + DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> + p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> + pmd->ctx.last_rxq;
> + dp_packet_batch_add(&p->output_pkts, packet);
> + }
> + return true;
> +}
> +
> +static bool
> +dp_execute_lb_output_action(struct dp_netdev_pmd_thread *pmd,
> + struct dp_packet_batch *packets_,
> + bool should_steal, uint32_t bond)
> +{
> + struct tx_bond *p_bond = pmd_tx_bond_cache_lookup(pmd, bond);
> + if (!p_bond) {
OVS_UNLIKELY ?
> + return false;
> + }
> +
> + struct dp_packet_batch del_pkts;
> + dp_packet_batch_init(&del_pkts);
> +
> + struct dp_packet *packet;
> + DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> + /*
> + * Lookup the bond-hash table using hash to get the slave.
> + */
> + uint32_t hash = dp_packet_get_rss_hash(packet);
> + struct slave_entry *s_entry = &p_bond->slave_buckets[hash &
> BOND_MASK];
> + odp_port_t bond_member = s_entry->slave_id;
> + uint32_t size = dp_packet_size(packet);
> +
> + struct dp_packet_batch output_pkt;
> + dp_packet_batch_init_packet(&output_pkt, packet);
> + if (OVS_LIKELY(dp_execute_output_action(pmd, &output_pkt,
> should_steal,
> + bond_member))) {
> + /* Update slave stats. */
> + non_atomic_ullong_add(&s_entry->n_packets, 1);
> + non_atomic_ullong_add(&s_entry->n_bytes, size);
> + } else {
> + dp_packet_batch_add(&del_pkts, packet);
> + }
> + }
> +
> + /* Delete packets that failed OUTPUT action */
> + dp_packet_delete_batch(&del_pkts, should_steal);
We have recently introduced drop stats. No packets should be dropped
now without accounting. Please, introduce new counter for that.
Details: a13a0209750c ("userspace: Improved packet drop statistics.")
> + return true;
> +}
> +
> static void
> dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
> const struct nlattr *a, bool should_steal)
> @@ -7072,43 +7416,15 @@ dp_execute_cb(void *aux_, struct dp_packet_batch
> *packets_,
>
> switch ((enum ovs_action_attr)type) {
> case OVS_ACTION_ATTR_OUTPUT:
> - p = pmd_send_port_cache_lookup(pmd, nl_attr_get_odp_port(a));
> - if (OVS_LIKELY(p)) {
> - struct dp_packet *packet;
> - struct dp_packet_batch out;
> -
> - if (!should_steal) {
> - dp_packet_batch_clone(&out, packets_);
> - dp_packet_batch_reset_cutlen(packets_);
> - packets_ = &out;
> - }
> - dp_packet_batch_apply_cutlen(packets_);
> -
> -#ifdef DPDK_NETDEV
> - if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
> - && packets_->packets[0]->source
> - != p->output_pkts.packets[0]->source)) {
> - /* XXX: netdev-dpdk assumes that all packets in a single
> - * output batch has the same source. Flush here to
> - * avoid memory access issues. */
> - dp_netdev_pmd_flush_output_on_port(pmd, p);
> - }
> -#endif
> - if (dp_packet_batch_size(&p->output_pkts)
> - + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
> - /* Flush here to avoid overflow. */
> - dp_netdev_pmd_flush_output_on_port(pmd, p);
> - }
> -
> - if (dp_packet_batch_is_empty(&p->output_pkts)) {
> - pmd->n_output_batches++;
> - }
> + if (dp_execute_output_action(pmd, packets_, should_steal,
> + nl_attr_get_odp_port(a))) {
> + return;
> + }
> + break;
>
> - DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> - p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> -
> pmd->ctx.last_rxq;
> - dp_packet_batch_add(&p->output_pkts, packet);
> - }
> + case OVS_ACTION_ATTR_LB_OUTPUT:
> + if (dp_execute_lb_output_action(pmd, packets_, should_steal,
> + nl_attr_get_u32(a))) {
> return;
> } else {
> COVERAGE_ADD(datapath_drop_invalid_port,
This doesn't look correct. 'else' branch is for OVS_ACTION_ATTR_OUTPUT.
> @@ -7666,6 +7982,109 @@ dpif_netdev_ipf_dump_done(struct dpif *dpif
> OVS_UNUSED, void *ipf_dump_ctx)
>
> }
>
> +static int
> +dpif_netdev_bond_add(struct dpif *dpif, uint32_t bond_id,
> + odp_port_t slave_map[])
> +{
> + struct dp_netdev *dp = get_dp_netdev(dpif);
> +
> + ovs_mutex_lock(&dp->bond_mutex);
> + /*
> + * Lookup for the bond. If already exists, just update the slave-map.
> + * Else create new.
> + */
> + struct tx_bond *dp_bond_entry = tx_bond_lookup(&dp->bonds, bond_id);
> + if (dp_bond_entry) {
> + for (int bucket = 0; bucket <= BOND_MASK; bucket++) {
> + dp_bond_entry->slave_buckets[bucket].slave_id =
> slave_map[bucket];
> + }
> +
> + struct dp_netdev_pmd_thread *pmd;
> + CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> + ovs_mutex_lock(&pmd->bond_mutex);
> + dp_netdev_add_bond_tx_to_pmd(pmd, dp_bond_entry);
> + ovs_mutex_unlock(&pmd->bond_mutex);
> + }
> + } else {
> + struct tx_bond *dp_bond = xzalloc(sizeof *dp_bond);
> + dp_bond->bond_id = bond_id;
> + for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> + dp_bond->slave_buckets[bucket].slave_id = slave_map[bucket];
> + }
> + hmap_insert(&dp->bonds, &dp_bond->node,
> + hash_bond_id(dp_bond->bond_id));
> +
> + /* Insert the bond map in all pmds. */
> + struct dp_netdev_pmd_thread *pmd;
> + CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> + ovs_mutex_lock(&pmd->bond_mutex);
> + dp_netdev_add_bond_tx_to_pmd(pmd, dp_bond);
> + ovs_mutex_unlock(&pmd->bond_mutex);
> + }
> + }
> + ovs_mutex_unlock(&dp->bond_mutex);
> + return 0;
> +}
> +
> +static int
> +dpif_netdev_bond_del(struct dpif *dpif, uint32_t bond_id)
> +{
> + struct dp_netdev *dp = get_dp_netdev(dpif);
> +
> + ovs_mutex_lock(&dp->bond_mutex);
> +
> + /* Find the bond and delete it if present */
> + struct tx_bond *dp_bond_entry = tx_bond_lookup(&dp->bonds, bond_id);
> + if (dp_bond_entry) {
> + /* Remove the bond map in all pmds. */
> + struct dp_netdev_pmd_thread *pmd;
> + CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> + ovs_mutex_lock(&pmd->bond_mutex);
> + dp_netdev_del_bond_tx_from_pmd(pmd, dp_bond_entry);
> + ovs_mutex_unlock(&pmd->bond_mutex);
> + }
> + hmap_remove(&dp->bonds, &dp_bond_entry->node);
> + free(dp_bond_entry);
> + }
> +
> + ovs_mutex_unlock(&dp->bond_mutex);
> + return 0;
> +}
> +
> +static int
> +dpif_netdev_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> + uint64_t *n_bytes)
> +{
> + struct dp_netdev *dp = get_dp_netdev(dpif);
> +
> + ovs_mutex_lock(&dp->bond_mutex);
> +
> + /* Find the bond and retrieve stats if present */
> + struct tx_bond *dp_bond_entry = tx_bond_lookup(&dp->bonds, bond_id);
> + if (dp_bond_entry) {
> + /* Search the bond in all PMDs */
> + struct dp_netdev_pmd_thread *pmd;
> + CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> + ovs_mutex_lock(&pmd->bond_mutex);
> + struct tx_bond *pmd_bond_entry
> + = tx_bond_lookup(&pmd->bond_cache, bond_id);
You can't access pmd->bond_cache. Only the owning PMD thread allowed to do
that.
> + if (pmd_bond_entry) {
> + /* Read bond stats. */
> + for (int i = 0; i <= BOND_MASK; i++) {
> + uint64_t pmd_n_bytes;
> + atomic_read_relaxed(
> + &pmd_bond_entry->slave_buckets[i].n_bytes,
> + &pmd_n_bytes);
> + n_bytes[i] += pmd_n_bytes;
> + }
> + }
> + ovs_mutex_unlock(&pmd->bond_mutex);
> + }
> + }
> + ovs_mutex_unlock(&dp->bond_mutex);
> + return 0;
> +}
> +
> const struct dpif_class dpif_netdev_class = {
> "netdev",
> true, /* cleanup_required */
> @@ -7739,6 +8158,9 @@ const struct dpif_class dpif_netdev_class = {
> dpif_netdev_meter_set,
> dpif_netdev_meter_get,
> dpif_netdev_meter_del,
> + dpif_netdev_bond_add,
> + dpif_netdev_bond_del,
> + dpif_netdev_bond_stats_get,
> };
>
> static void
> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> index 3d97c78..b1555df 100644
> --- a/lib/dpif-netlink.c
> +++ b/lib/dpif-netlink.c
> @@ -3995,6 +3995,9 @@ const struct dpif_class dpif_netlink_class = {
> dpif_netlink_meter_set,
> dpif_netlink_meter_get,
> dpif_netlink_meter_del,
> + NULL, /* bond_add */
> + NULL, /* bond_del */
> + NULL, /* bond_stats_get */
> };
>
> static int
> diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
> index b77317b..4415836 100644
> --- a/lib/dpif-provider.h
> +++ b/lib/dpif-provider.h
> @@ -616,6 +616,15 @@ struct dpif_class {
> * zero. */
> int (*meter_del)(struct dpif *, ofproto_meter_id meter_id,
> struct ofputil_meter_stats *, uint16_t n_bands);
> +
> + /* Adds a bond with 'bond_id' and the slave-map to 'dpif'. */
> + int (*bond_add)(struct dpif *dpif, uint32_t bond_id,
> + odp_port_t slave_map[]);
> + /* Removes bond identified by 'bond_id' from 'dpif'. */
> + int (*bond_del)(struct dpif *dpif, uint32_t bond_id);
> + /* Reads bond stats from 'dpif'. */
> + int (*bond_stats_get)(struct dpif *dpif, uint32_t bond_id,
> + uint64_t *n_bytes);
> };
>
> extern const struct dpif_class dpif_netlink_class;
> diff --git a/lib/dpif.c b/lib/dpif.c
> index 955eaf6..41c0886 100644
> --- a/lib/dpif.c
> +++ b/lib/dpif.c
> @@ -1186,6 +1186,7 @@ dpif_execute_helper_cb(void *aux_, struct
> dp_packet_batch *packets_,
>
> case OVS_ACTION_ATTR_CT:
> case OVS_ACTION_ATTR_OUTPUT:
> + case OVS_ACTION_ATTR_LB_OUTPUT:
> case OVS_ACTION_ATTR_TUNNEL_PUSH:
> case OVS_ACTION_ATTR_TUNNEL_POP:
> case OVS_ACTION_ATTR_USERSPACE:
> @@ -1236,6 +1237,7 @@ dpif_execute_helper_cb(void *aux_, struct
> dp_packet_batch *packets_,
> struct dp_packet *clone = NULL;
> uint32_t cutlen = dp_packet_get_cutlen(packet);
> if (cutlen && (type == OVS_ACTION_ATTR_OUTPUT
> + || type == OVS_ACTION_ATTR_LB_OUTPUT
> || type == OVS_ACTION_ATTR_TUNNEL_PUSH
> || type == OVS_ACTION_ATTR_TUNNEL_POP
> || type == OVS_ACTION_ATTR_USERSPACE)) {
> @@ -1895,6 +1897,16 @@ dpif_supports_explicit_drop_action(const struct dpif
> *dpif)
> return dpif_is_netdev(dpif);
> }
>
> +bool
> +dpif_supports_lb_output_action(const struct dpif *dpif)
> +{
> + /*
> + * Balance-tcp optimization is currently supported in netdev
> + * datapath only.
> + */
> + return dpif_is_netdev(dpif);
> +}
> +
> /* Meters */
> void
> dpif_meter_get_features(const struct dpif *dpif,
> @@ -1992,3 +2004,40 @@ dpif_meter_del(struct dpif *dpif, ofproto_meter_id
> meter_id,
> }
> return error;
> }
> +
> +int
> +dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t slave_map[])
> +{
> + int error = -1;
> +
> + if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
> + error = dpif->dpif_class->bond_add(dpif, bond_id, slave_map);
> + }
> +
> + return error;
> +}
> +
> +int
> +dpif_bond_del(struct dpif *dpif, uint32_t bond_id)
> +{
> + int error = -1;
> +
> + if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
> + error = dpif->dpif_class->bond_del(dpif, bond_id);
> + }
> +
> + return error;
> +}
> +
> +int
> +dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> + uint64_t *n_bytes)
> +{
> + int error = -1;
> +
> + if (dpif && dpif->dpif_class && dpif->dpif_class->bond_stats_get) {
> + error = dpif->dpif_class->bond_stats_get(dpif, bond_id, n_bytes);
> + }
> +
> + return error;
> +}
> diff --git a/lib/dpif.h b/lib/dpif.h
> index a88851e..6b3216f 100644
> --- a/lib/dpif.h
> +++ b/lib/dpif.h
> @@ -908,6 +908,13 @@ char *dpif_get_dp_version(const struct dpif *);
> bool dpif_supports_tnl_push_pop(const struct dpif *);
> bool dpif_supports_explicit_drop_action(const struct dpif *);
>
> +bool dpif_supports_lb_output_action(const struct dpif *);
> +
> +int dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t
> slave_map[]);
> +int dpif_bond_del(struct dpif *dpif, uint32_t bond_id);
> +int dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> + uint64_t *n_bytes);
> +
> /* Log functions. */
> struct vlog_module;
>
> diff --git a/lib/odp-execute.c b/lib/odp-execute.c
> index 42d3335..6018e37 100644
> --- a/lib/odp-execute.c
> +++ b/lib/odp-execute.c
> @@ -793,6 +793,7 @@ requires_datapath_assistance(const struct nlattr *a)
> switch (type) {
> /* These only make sense in the context of a datapath. */
> case OVS_ACTION_ATTR_OUTPUT:
> + case OVS_ACTION_ATTR_LB_OUTPUT:
> case OVS_ACTION_ATTR_TUNNEL_PUSH:
> case OVS_ACTION_ATTR_TUNNEL_POP:
> case OVS_ACTION_ATTR_USERSPACE:
> @@ -1068,6 +1069,7 @@ odp_execute_actions(void *dp, struct dp_packet_batch
> *batch, bool steal,
> return;
> }
> case OVS_ACTION_ATTR_OUTPUT:
> + case OVS_ACTION_ATTR_LB_OUTPUT:
> case OVS_ACTION_ATTR_TUNNEL_PUSH:
> case OVS_ACTION_ATTR_TUNNEL_POP:
> case OVS_ACTION_ATTR_USERSPACE:
> diff --git a/lib/odp-util.c b/lib/odp-util.c
> index d6cb9e6..3472a8c 100644
> --- a/lib/odp-util.c
> +++ b/lib/odp-util.c
> @@ -119,6 +119,7 @@ odp_action_len(uint16_t type)
>
> switch ((enum ovs_action_attr) type) {
> case OVS_ACTION_ATTR_OUTPUT: return sizeof(uint32_t);
> + case OVS_ACTION_ATTR_LB_OUTPUT: return sizeof(uint32_t);
> case OVS_ACTION_ATTR_TRUNC: return sizeof(struct ovs_action_trunc);
> case OVS_ACTION_ATTR_TUNNEL_PUSH: return ATTR_LEN_VARIABLE;
> case OVS_ACTION_ATTR_TUNNEL_POP: return sizeof(uint32_t);
> @@ -1122,6 +1123,9 @@ format_odp_action(struct ds *ds, const struct nlattr *a,
> case OVS_ACTION_ATTR_OUTPUT:
> odp_portno_name_format(portno_names, nl_attr_get_odp_port(a), ds);
> break;
> + case OVS_ACTION_ATTR_LB_OUTPUT:
> + ds_put_format(ds, "lb_output(bond,%"PRIu32")", nl_attr_get_u32(a));
> + break;
> case OVS_ACTION_ATTR_TRUNC: {
> const struct ovs_action_trunc *trunc =
> nl_attr_get_unspec(a, sizeof *trunc);
> diff --git a/ofproto/bond.c b/ofproto/bond.c
> index 405202f..d540635 100644
> --- a/ofproto/bond.c
> +++ b/ofproto/bond.c
> @@ -54,10 +54,6 @@ static struct ovs_rwlock rwlock = OVS_RWLOCK_INITIALIZER;
> static struct hmap all_bonds__ = HMAP_INITIALIZER(&all_bonds__);
> static struct hmap *const all_bonds OVS_GUARDED_BY(rwlock) = &all_bonds__;
>
> -/* Bit-mask for hashing a flow down to a bucket. */
> -#define BOND_MASK 0xff
> -#define BOND_BUCKETS (BOND_MASK + 1)
> -
> /* Priority for internal rules created to handle recirculation */
> #define RECIRC_RULE_PRIORITY 20
>
> @@ -126,6 +122,8 @@ struct bond {
> enum lacp_status lacp_status; /* Status of LACP negotiations. */
> bool bond_revalidate; /* True if flows need revalidation. */
> uint32_t basis; /* Basis for flow hash function. */
> + bool use_bond_cache; /* Use bond cache to avoid recirculation.
> + Applicable only for Balance TCP mode. */
>
> /* SLB specific bonding info. */
> struct bond_entry *hash; /* An array of BOND_BUCKETS elements. */
> @@ -185,7 +183,7 @@ static struct bond_slave *choose_output_slave(const
> struct bond *,
> struct flow_wildcards *,
> uint16_t vlan)
> OVS_REQ_RDLOCK(rwlock);
> -static void update_recirc_rules__(struct bond *bond);
> +static void update_recirc_rules__(struct bond *bond, uint32_t
> bond_recirc_id);
> static bool bond_is_falling_back_to_ab(const struct bond *);
>
> /* Attempts to parse 's' as the name of a bond balancing mode. If
> successful,
> @@ -262,6 +260,7 @@ void
> bond_unref(struct bond *bond)
> {
> struct bond_slave *slave;
> + uint32_t bond_recirc_id = 0;
>
> if (!bond || ovs_refcount_unref_relaxed(&bond->ref_cnt) != 1) {
> return;
> @@ -282,12 +281,13 @@ bond_unref(struct bond *bond)
>
> /* Free bond resources. Remove existing post recirc rules. */
> if (bond->recirc_id) {
> + bond_recirc_id = bond->recirc_id;
> recirc_free_id(bond->recirc_id);
> bond->recirc_id = 0;
> }
> free(bond->hash);
> bond->hash = NULL;
> - update_recirc_rules__(bond);
> + update_recirc_rules__(bond, bond_recirc_id);
>
> hmap_destroy(&bond->pr_rule_ops);
> free(bond->name);
> @@ -328,13 +328,14 @@ add_pr_rule(struct bond *bond, const struct match
> *match,
> * lock annotation. Currently, only 'bond_unref()' calls
> * this function directly. */
> static void
> -update_recirc_rules__(struct bond *bond)
> +update_recirc_rules__(struct bond *bond, uint32_t bond_recirc_id)
> {
> struct match match;
> struct bond_pr_rule_op *pr_op, *next_op;
> uint64_t ofpacts_stub[128 / 8];
> struct ofpbuf ofpacts;
> int i;
> + ofp_port_t slave_map[BOND_MASK];
>
> ofpbuf_use_stub(&ofpacts, ofpacts_stub, sizeof ofpacts_stub);
>
> @@ -353,8 +354,18 @@ update_recirc_rules__(struct bond *bond)
>
> add_pr_rule(bond, &match, slave->ofp_port,
> &bond->hash[i].pr_rule);
> + slave_map[i] = slave->ofp_port;
> + } else {
> + slave_map[i] = OFP_PORT_C(-1);
> }
> }
> + if (bond->ofproto->backer->rt_support.lb_output_action) {
> + ofproto_dpif_bundle_add(bond->ofproto, bond->recirc_id,
> slave_map);
> + }
> + } else {
> + if (bond->ofproto->backer->rt_support.lb_output_action) {
> + ofproto_dpif_bundle_del(bond->ofproto, bond_recirc_id);
> + }
> }
>
> HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
> @@ -404,7 +415,7 @@ static void
> update_recirc_rules(struct bond *bond)
> OVS_REQ_RDLOCK(rwlock)
> {
> - update_recirc_rules__(bond);
> + update_recirc_rules__(bond, bond->recirc_id);
> }
>
> /* Updates 'bond''s overall configuration to 's'.
> @@ -467,6 +478,10 @@ bond_reconfigure(struct bond *bond, const struct
> bond_settings *s)
> recirc_free_id(bond->recirc_id);
> bond->recirc_id = 0;
> }
> + if (bond->use_bond_cache != s->use_bond_cache) {
> + bond->use_bond_cache = s->use_bond_cache;
> + revalidate = true;
> + }
>
> if (bond->balance == BM_AB || !bond->hash || revalidate) {
> bond_entry_reset(bond);
> @@ -944,6 +959,15 @@ bond_recirculation_account(struct bond *bond)
> OVS_REQ_WRLOCK(rwlock)
> {
> int i;
> + uint64_t n_bytes[BOND_BUCKETS] = {0};
> +
> + if (bond->hash && bond->recirc_id) {
> + /* Retrieve bond stats from datapath. */
> + if (bond->ofproto->backer->rt_support.lb_output_action) {
> + dpif_bond_stats_get(bond->ofproto->backer->dpif,
> + bond->recirc_id, n_bytes);
> + }
> + }
>
> for (i=0; i<=BOND_MASK; i++) {
> struct bond_entry *entry = &bond->hash[i];
> @@ -953,8 +977,12 @@ bond_recirculation_account(struct bond *bond)
> struct pkt_stats stats;
> long long int used OVS_UNUSED;
>
> - rule->ofproto->ofproto_class->rule_get_stats(
> - rule, &stats, &used);
> + if (!bond->ofproto->backer->rt_support.lb_output_action) {
> + rule->ofproto->ofproto_class->rule_get_stats(
> + rule, &stats, &used);
> + } else {
> + stats.n_bytes = n_bytes[i];
> + }
> bond_entry_account(entry, stats.n_bytes);
> }
> }
> @@ -1365,6 +1393,8 @@ bond_print_details(struct ds *ds, const struct bond
> *bond)
> may_recirc ? "yes" : "no", may_recirc ? recirc_id: -1);
>
> ds_put_format(ds, "bond-hash-basis: %"PRIu32"\n", bond->basis);
> + ds_put_format(ds, "lb-output-action: %s\n",
> + bond->use_bond_cache ? "enabled" : "disabled");
>
> ds_put_format(ds, "updelay: %d ms\n", bond->updelay);
> ds_put_format(ds, "downdelay: %d ms\n", bond->downdelay);
> @@ -1942,3 +1972,9 @@ bond_get_changed_active_slave(const char *name, struct
> eth_addr *mac,
>
> return false;
> }
> +
> +bool
> +bond_get_cache_mode(const struct bond *bond)
> +{
> + return bond->use_bond_cache;
> +}
> diff --git a/ofproto/bond.h b/ofproto/bond.h
> index e7c3d9b..88a4de1 100644
> --- a/ofproto/bond.h
> +++ b/ofproto/bond.h
> @@ -22,6 +22,10 @@
> #include "ofproto-provider.h"
> #include "packets.h"
>
> +/* Bit-mask for hashing a flow down to a bucket. */
> +#define BOND_MASK 0xff
> +#define BOND_BUCKETS (BOND_MASK + 1)
> +
> struct flow;
> struct netdev;
> struct ofpbuf;
> @@ -58,6 +62,8 @@ struct bond_settings {
> /* The MAC address of the interface
> that was active during the last
> ovs run. */
> + bool use_bond_cache; /* Use bond cache. Only applicable for
> + bond mode BALANCE TCP. */
> };
>
> /* Program startup. */
> @@ -122,4 +128,7 @@ void bond_rebalance(struct bond *);
> */
> void bond_update_post_recirc_rules(struct bond *, uint32_t *recirc_id,
> uint32_t *hash_basis);
> +
> +bool bond_get_cache_mode(const struct bond *);
> +
> #endif /* bond.h */
> diff --git a/ofproto/ofproto-dpif-ipfix.c b/ofproto/ofproto-dpif-ipfix.c
> index b413768..5030978 100644
> --- a/ofproto/ofproto-dpif-ipfix.c
> +++ b/ofproto/ofproto-dpif-ipfix.c
> @@ -3017,6 +3017,7 @@ dpif_ipfix_read_actions(const struct flow *flow,
> case OVS_ACTION_ATTR_CHECK_PKT_LEN:
> case OVS_ACTION_ATTR_UNSPEC:
> case OVS_ACTION_ATTR_DROP:
> + case OVS_ACTION_ATTR_LB_OUTPUT:
> case __OVS_ACTION_ATTR_MAX:
> default:
> break;
> diff --git a/ofproto/ofproto-dpif-sflow.c b/ofproto/ofproto-dpif-sflow.c
> index f9ea47a..21dafa6 100644
> --- a/ofproto/ofproto-dpif-sflow.c
> +++ b/ofproto/ofproto-dpif-sflow.c
> @@ -1177,6 +1177,7 @@ dpif_sflow_read_actions(const struct flow *flow,
> case OVS_ACTION_ATTR_CT:
> case OVS_ACTION_ATTR_CT_CLEAR:
> case OVS_ACTION_ATTR_METER:
> + case OVS_ACTION_ATTR_LB_OUTPUT:
> break;
>
> case OVS_ACTION_ATTR_SET_MASKED:
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index 4407f9c..0446387 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -409,6 +409,8 @@ struct xlate_ctx {
> struct ofpbuf action_set; /* Action set. */
>
> enum xlate_error error; /* Translation failed. */
> +
> + bool tnl_push_no_recirc; /* Tunnel push recirculation status */
> };
>
> /* Structure to track VLAN manipulation */
> @@ -3727,12 +3729,16 @@ native_tunnel_output(struct xlate_ctx *ctx, const
> struct xport *xport,
> ctx->xin->allow_side_effects = backup_side_effects;
> ctx->xin->packet = backup_packet;
> ctx->wc = backup_wc;
> +
> + ctx->tnl_push_no_recirc = true;
> } else {
> /* In order to maintain accurate stats, use recirc for
> * natvie tunneling. */
> nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC, 0);
> nl_msg_end_nested(ctx->odp_actions, clone_ofs);
> - }
> +
> + ctx->tnl_push_no_recirc = false;
> + }
>
> /* Restore the flows after the translation. */
> memcpy(&ctx->xin->flow, &old_flow, sizeof ctx->xin->flow);
> @@ -4183,7 +4189,6 @@ compose_output_action__(struct xlate_ctx *ctx,
> ofp_port_t ofp_port,
> xlate_commit_actions(ctx);
>
> if (xr) {
> - /* Recirculate the packet. */
> struct ovs_action_hash *act_hash;
>
> /* Hash action. */
> @@ -4192,15 +4197,30 @@ compose_output_action__(struct xlate_ctx *ctx,
> ofp_port_t ofp_port,
> /* Algorithm supported by all datapaths. */
> hash_alg = OVS_HASH_ALG_L4;
> }
> - act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
> +
> + if ((xr->hash_alg == OVS_HASH_ALG_L4) &&
> + (ctx->xbridge->support.lb_output_action) &&
> + !ctx->tnl_push_no_recirc &&
> + bond_get_cache_mode(xport->xbundle->bond)) {
> + /*
> + * If bond mode is balance-tcp and optimize balance tcp
> + * is enabled then use the hash directly for slave selection
> + * and avoid recirculation.
> + *
> + * Currently support for netdev datapath only.
> + */
> + nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_LB_OUTPUT,
> + xr->recirc_id);
> + } else {
> + act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
> OVS_ACTION_ATTR_HASH,
> sizeof *act_hash);
> - act_hash->hash_alg = hash_alg;
> - act_hash->hash_basis = xr->hash_basis;
> -
> - /* Recirc action. */
> - nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
> - xr->recirc_id);
> + act_hash->hash_alg = hash_alg;
> + act_hash->hash_basis = xr->hash_basis;
> + /* Recirculate the packet. */
> + nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
> + xr->recirc_id);
> + }
> } else if (is_native_tunnel) {
> /* Output to native tunnel port. */
> native_tunnel_output(ctx, xport, flow, odp_port, truncate);
> @@ -7261,7 +7281,8 @@ count_output_actions(const struct ofpbuf *odp_actions)
> int n = 0;
>
> NL_ATTR_FOR_EACH_UNSAFE (a, left, odp_actions->data, odp_actions->size) {
> - if (a->nla_type == OVS_ACTION_ATTR_OUTPUT) {
> + if ((a->nla_type == OVS_ACTION_ATTR_OUTPUT) ||
> + (a->nla_type == OVS_ACTION_ATTR_LB_OUTPUT)) {
> n++;
> }
> }
> @@ -7458,6 +7479,7 @@ xlate_actions(struct xlate_in *xin, struct xlate_out
> *xout)
>
> .action_set_has_group = false,
> .action_set = OFPBUF_STUB_INITIALIZER(action_set_stub),
> + .tnl_push_no_recirc = false,
> };
>
> /* 'base_flow' reflects the packet as it came in, but we need it to
> reflect
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index f782f86..3c5438b 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -1580,6 +1580,8 @@ check_support(struct dpif_backer *backer)
> backer->rt_support.ct_timeout = check_ct_timeout_policy(backer);
> backer->rt_support.explicit_drop_action =
> dpif_supports_explicit_drop_action(backer->dpif);
> + backer->rt_support.lb_output_action=
> + dpif_supports_lb_output_action(backer->dpif);
>
> /* Flow fields. */
> backer->rt_support.odp.ct_state = check_ct_state(backer);
> @@ -3433,6 +3435,27 @@ bundle_remove(struct ofport *port_)
> }
> }
>
> +int
> +ofproto_dpif_bundle_add(struct ofproto_dpif *ofproto,
> + uint32_t bond_id, const ofp_port_t slave_map[])
> +{
> + /* Convert ofp_port to odp_port */
> + odp_port_t odp_map[BOND_BUCKETS];
> + for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> + odp_map[bucket] = (slave_map[bucket] == OFP_PORT_C(-1)
> + ? ODP_PORT_C(-1)
> + : ofp_port_to_odp_port(ofproto,
> slave_map[bucket]));
> + }
> +
> + return dpif_bond_add(ofproto->backer->dpif, bond_id, odp_map);
> +}
> +
> +int
> +ofproto_dpif_bundle_del(struct ofproto_dpif *ofproto, uint32_t bond_id)
> +{
> + return dpif_bond_del(ofproto->backer->dpif, bond_id);
> +}
> +
> static void
> send_pdu_cb(void *port_, const void *pdu, size_t pdu_size)
> {
> @@ -5563,6 +5586,7 @@ get_datapath_cap(const char *datapath_type, struct smap
> *cap)
> smap_add(cap, "ct_timeout", s.ct_timeout ? "true" : "false");
> smap_add(cap, "explicit_drop_action",
> s.explicit_drop_action ? "true" :"false");
> + smap_add(cap, "lb_output_action", s.lb_output_action ? "true" : "false");
> }
>
> /* Gets timeout policy name in 'backer' based on 'zone', 'dl_type' and
> diff --git a/ofproto/ofproto-dpif.h b/ofproto/ofproto-dpif.h
> index c9d5df3..384a269 100644
> --- a/ofproto/ofproto-dpif.h
> +++ b/ofproto/ofproto-dpif.h
> @@ -202,8 +202,10 @@ struct group_dpif *group_dpif_lookup(struct ofproto_dpif
> *,
> DPIF_SUPPORT_FIELD(bool, ct_timeout, "Conntrack timeout policy") \
> \
> /* True if the datapath supports explicit drop action. */ \
> - DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop action")
> -
> + DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop action") \
> + \
> + /* True if the datapath supports balance_tcp optimization */ \
> + DPIF_SUPPORT_FIELD(bool, lb_output_action, "Optimized Balance TCP mode")
>
> /* Stores the various features which the corresponding backer supports. */
> struct dpif_backer_support {
> @@ -379,6 +381,9 @@ int ofproto_dpif_add_internal_flow(struct ofproto_dpif *,
> struct rule **rulep);
> int ofproto_dpif_delete_internal_flow(struct ofproto_dpif *, struct match *,
> int priority);
> +int ofproto_dpif_bundle_add(struct ofproto_dpif *, uint32_t bond_id,
> + const ofp_port_t slave_map[]);
> +int ofproto_dpif_bundle_del(struct ofproto_dpif *, uint32_t bond_id);
>
> bool ovs_native_tunneling_is_on(struct ofproto_dpif *);
>
> diff --git a/tests/lacp.at b/tests/lacp.at
> index 7b460d7..c881a0b 100644
> --- a/tests/lacp.at
> +++ b/tests/lacp.at
> @@ -121,6 +121,7 @@ AT_CHECK([ovs-appctl bond/show], [0], [dnl
> bond_mode: active-backup
> bond may use recirculation: no, Recirc-ID : -1
> bond-hash-basis: 0
> +lb-output-action: disabled
> updelay: 0 ms
> downdelay: 0 ms
> lacp_status: negotiated
> @@ -286,6 +287,7 @@ slave: p3: current attached
> bond_mode: balance-tcp
> bond may use recirculation: yes, <del>
> bond-hash-basis: 0
> +lb-output-action: disabled
> updelay: 0 ms
> downdelay: 0 ms
> lacp_status: negotiated
> @@ -301,6 +303,7 @@ slave p1: enabled
> bond_mode: balance-tcp
> bond may use recirculation: yes, <del>
> bond-hash-basis: 0
> +lb-output-action: disabled
> updelay: 0 ms
> downdelay: 0 ms
> lacp_status: negotiated
> @@ -423,6 +426,7 @@ slave: p3: current attached
> bond_mode: balance-tcp
> bond may use recirculation: yes, <del>
> bond-hash-basis: 0
> +lb-output-action: disabled
> updelay: 0 ms
> downdelay: 0 ms
> lacp_status: negotiated
> @@ -440,6 +444,7 @@ slave p1: enabled
> bond_mode: balance-tcp
> bond may use recirculation: yes, <del>
> bond-hash-basis: 0
> +lb-output-action: disabled
> updelay: 0 ms
> downdelay: 0 ms
> lacp_status: negotiated
> @@ -555,6 +560,7 @@ slave: p3: current attached
> bond_mode: balance-tcp
> bond may use recirculation: yes, <del>
> bond-hash-basis: 0
> +lb-output-action: disabled
> updelay: 0 ms
> downdelay: 0 ms
> lacp_status: negotiated
> @@ -572,6 +578,7 @@ slave p1: enabled
> bond_mode: balance-tcp
> bond may use recirculation: yes, <del>
> bond-hash-basis: 0
> +lb-output-action: disabled
> updelay: 0 ms
> downdelay: 0 ms
> lacp_status: negotiated
> @@ -692,6 +699,7 @@ slave: p3: current attached
> bond_mode: balance-tcp
> bond may use recirculation: yes, <del>
> bond-hash-basis: 0
> +lb-output-action: disabled
> updelay: 0 ms
> downdelay: 0 ms
> lacp_status: negotiated
> @@ -709,6 +717,7 @@ slave p1: enabled
> bond_mode: balance-tcp
> bond may use recirculation: yes, <del>
> bond-hash-basis: 0
> +lb-output-action: disabled
> updelay: 0 ms
> downdelay: 0 ms
> lacp_status: negotiated
> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> index 86c7b10..d4f34ad 100644
> --- a/vswitchd/bridge.c
> +++ b/vswitchd/bridge.c
> @@ -4589,6 +4589,12 @@ port_configure_bond(struct port *port, struct
> bond_settings *s)
> /* OVSDB did not store the last active interface */
> s->active_slave_mac = eth_addr_zero;
> }
> + /* Default set bond cache use to false. */
> + s->use_bond_cache = false;
> + if (s->balance == BM_TCP) {
> + s->use_bond_cache = smap_get_bool(&port->cfg->other_config,
> + "lb-output-action", false);
> + }
> }
>
> /* Returns true if 'port' is synthetic, that is, if we constructed it locally
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 0ec726c..bdc7ef4 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -1962,6 +1962,16 @@
> <code>active-backup</code>.
> </column>
>
> + <column name="other_config" key="lb-output-action"
> + type='{"type": "boolean"}'>
> + Enable/disable usage of optimized lb-output-action for
> + balancing flows among output slaves in load balanced bonds in
> + <code>balance-tcp</code>. When enabled, it uses optimized path for
> + balance-tcp mode by using rss hash and avoids recirculation.
> + It affects only new flows, i.e, existing flows remain unchanged.
> + This knob does not affect other balancing modes.
> + </column>
> +
> <group title="Link Failure Detection">
> <p>
> An important part of link bonding is detecting that links are down
> so
> @@ -5722,6 +5732,19 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
> type=patch options:peer=p1 \
> higher performance for MPLS and active-active load balancing
> bonding modes.
> </column>
> + <column name="capabilities" key="lb_output_action"
> + type='{"type": "boolean"}'>
> + If this is true, then the datapath supports optimized balance-tcp
> + bond mode. This capability replaces existing <code>hash</code> and
> + <code>recirc</code> actions with new action <code>lb_output</code>
> + and avoids recirculation of packet in datapath. It is supported
> + only for balance-tcp bond mode in netdev datapath. The new action
> + gives higer performance but also requires pmd reloads whenever
> + traffic load needs to be rebalanced among bonded links. By default
> + this new action is disabled, however it can be enabled by setting
> + <ref column="other-config" key="lb-output-action"/> in
> + <ref table="Port"/> table.
> + </column>
> <group title="Connection-Tracking Capabilities">
> <p>
> These capabilities are granular because Open vSwitch and its
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev