Re: [ovs-dev] [RFC Patch] dpif-netdev: Sorted subtable vectors per in_port in dpcls

Jarno Rajahalme Tue, 21 Jun 2016 17:32:40 -0700

Thanks for doing this! Some suggestions below.

This no longer applies cleanly, and I also needed to add this incremental to 
make the testsuite pass:


diff --git a/tests/pmd.at b/tests/pmd.at
index 3d219ff..9b6427e 100644
--- a/tests/pmd.at
+++ b/tests/pmd.at
@@ -144,10 +144,11 @@ dummy@ovs-dummy: hit:0 missed:0
                p0 7/1: (dummy-pmd: configured_rx_queues=4, 
configured_tx_queues=<cleared>, requested_rx_queues=4, 
requested_tx_queues=<cleared>)
])

-AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | 
sed '/cycles/d' | grep pmd -A 4], [0], [dnl
+AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | 
sed '/cycles/d' | grep pmd -A 5], [0], [dnl
pmd thread numa_id <cleared> core_id <cleared>:
        emc hits:0
        megaflow hits:0
+       avg. subtable lookups per hit: 0.00
        miss:0
        lost:0
])
@@ -170,10 +171,11 @@ AT_CHECK([cat ovs-vswitchd.log | filter_flow_install | 
strip_xout], [0], [dnl
recirc_id(0),in_port(1),eth(src=50:54:00:00:00:77,dst=50:54:00:00:01:78),eth_type(0x0800),ipv4(frag=no),
 actions: <del>
])

-AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | 
sed '/cycles/d' | grep pmd -A 4], [0], [dnl
+AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed SED_NUMA_CORE_PATTERN | 
sed '/cycles/d' | grep pmd -A 5], [0], [dnl
pmd thread numa_id <cleared> core_id <cleared>:
        emc hits:19
        megaflow hits:0
+       avg. subtable lookups per hit: 0.00
        miss:1
        lost:0
])

> On Jun 16, 2016, at 6:56 AM, Jan Scheurich <jan.scheur...@ericsson.com> wrote:
> 
> The user-space datapath (dpif-netdev) consists of a first level "exact match 
> cache" (EMC) matching on 5-tuples and the normal megaflow classifier. With 
> many parallel packet flows (e.g. TCP connections) the EMC becomes inefficient 
> and the OVS forwarding performance is determined by the megaflow classifier.
> 
> The megaflow classifier (dpcls) consists of a variable number of hash tables 
> (aka subtables), each containing megaflow entries with the same mask of 
> packet header and metadata fields to match upon. A dpcls lookup matches a 
> given packet against all subtables in sequence until it hits a match. As 
> megaflow cache entries are by construction non-overlapping, the first match 
> is the only match.
> 
> Today the order of the subtables in the dpcls is essentially random so that 
> on average a dpcsl lookup has to visit N/2 subtables for a hit, when N is the 
> total number of subtables. Even though every single hash-table lookup is 
> fast, the performance of the current dpcls degrades when there are many 
> subtables.
> 
> How does the patch address this issue:
> 
> In reality there is often a strong correlation between the ingress port and a 
> small subset of subtables that have hits. The entire megaflow cache typically 
> decomposes nicely into partitions that are hit only by packets entering from 
> a range of similar ports (e.g. traffic from Phy  -> VM vs. traffic from VM -> 
> Phy). 
> 
> Therefore, keeping a separate list of subtables per ingress port, sorted by 
> frequency of hits, reduces the average number of subtables lookups in the 
> dpcls to a minimum, even if the total number of subtables gets large.

Did you consider keeping a separate dpcls instance per input port instead? 
I.e., have an hmap of dpcls'es using the in_port as the hash key, and creating 
and destroying them on demand. This would avoid the fixed size array, simplify 
the code and each insert and remove operation would be a bit cheaper. A miss 
would typically go though a lower number of subtables, even though the upcall 
cost would likely hide this benefit in practice. We are always exact matching 
in_port (as well as dl_type and recirc_id (see xlate_wc_init()), so this would 
add no overhead in terms of the overall number of megaflows.

> 
> The patch introduces 32 subtable vectors per dpcls and hashes the ingress 
> port to select the subtable vector. The patch also counts matches per 32 
> slots in each vector (hashing the subtable pointer to obtain the slot) and 
> sorts the vectors according to match frequency every second.
> 

Also, it should be possible to expand the pvector API to allow using the 
pvector priorities directly as the counts. I posted a patch for this 
(https://patchwork.ozlabs.org/patch/638920/), after which you could apply the 
following incremental. I haven't tested how this performs, though.

 Jarno

From: Jarno Rajahalme <ja...@ovn.org>
Date: Tue, 21 Jun 2016 17:17:55 -0700
Subject: [PATCH] dpif-netdev: Use non-concurrent pvector.
Cc: <ja...@ovn.org>

PMD threads do not need the concurrent version of the pvector.  The
non-concurrent pvector entry priorities can be incremented in place
and periodically sorted.

Signed-off-by: Jarno Rajahalme <ja...@ovn.org>
---
lib/dpif-netdev.c | 68 +++++++++++++++++++++++++++----------------------------
1 file changed, 34 insertions(+), 34 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index c491110..647bfb7 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -151,12 +151,11 @@ struct emc_cache {

#define NUM_SUBTABLE_VECTORS 32
#define NUM_SUBTABLE_SLOTS 32
-#define DPCLS_OPTIMIATION_INTERVAL 1000
+#define DPCLS_OPTIMIZATION_INTERVAL 1000

struct dpcls {
    struct cmap subtables_map;
-    struct pvector subtables[NUM_SUBTABLE_VECTORS];
-    uint64_t hit_cnt[NUM_SUBTABLE_VECTORS][NUM_SUBTABLE_SLOTS];
+    struct pvector_impl *subtables[NUM_SUBTABLE_VECTORS];
    long long int next_optimization;
};

@@ -4371,11 +4370,10 @@ dpcls_init(struct dpcls *cls)
    int i;

    cmap_init(&cls->subtables_map);
-    for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {
-        pvector_init(&cls->subtables[i]);
+    for (i = 0; i < NUM_SUBTABLE_VECTORS; i++) {
+        cls->subtables[i] = pvector_impl_alloc(PVECTOR_EXTRA_ALLOC);
    }
-    memset(cls->hit_cnt, 0, sizeof cls->hit_cnt);
-    cls->next_optimization = time_msec() + DPCLS_OPTIMIATION_INTERVAL;
+    cls->next_optimization = time_msec() + DPCLS_OPTIMIZATION_INTERVAL;
}

static void
@@ -4383,8 +4381,8 @@ dpcls_destroy_subtable(struct dpcls *cls, struct 
dpcls_subtable *subtable)
{
    int i;

-    for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {
-        pvector_remove(&cls->subtables[i], subtable);
+    for (i = 0; i < NUM_SUBTABLE_VECTORS; i++) {
+        pvector_impl_remove(cls->subtables[i], subtable);
    }
    cmap_remove(&cls->subtables_map, &subtable->cmap_node,
                subtable->mask.hash);
@@ -4407,8 +4405,8 @@ dpcls_destroy(struct dpcls *cls)
            dpcls_destroy_subtable(cls, subtable);
        }
        cmap_destroy(&cls->subtables_map);
-        for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {
-            pvector_destroy(&cls->subtables[i]);
+        for (i = 0; i < NUM_SUBTABLE_VECTORS; i++) {
+            free(cls->subtables[i]);
        }
    }
}
@@ -4417,7 +4415,6 @@ static struct dpcls_subtable *
dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
{
    struct dpcls_subtable *subtable;
-    int i;

    /* Need to add one. */
    subtable = xmalloc(sizeof *subtable
@@ -4425,10 +4422,10 @@ dpcls_create_subtable(struct dpcls *cls, const struct 
netdev_flow_key *mask)
    cmap_init(&subtable->rules);
    netdev_flow_key_clone(&subtable->mask, mask);
    cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
-    for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {
+
+    for (int i = 0; i < NUM_SUBTABLE_VECTORS; i++) {
        /* Insert new subtable with priority zero (no hits) */
-        pvector_insert(&cls->subtables[i], subtable, 0);
-        pvector_publish(&cls->subtables[i]);
+        pvector_impl_push_back(&cls->subtables[i], subtable, 0);
    }

    return subtable;
@@ -4456,22 +4453,22 @@ dpcls_try_optimize(struct dpcls *cls)
    long long int now = time_msec();

    if (now > cls->next_optimization) {
-        struct dpcls_subtable *subtable;
-        int port_idx, subtable_idx;
-
-        for (port_idx=0; port_idx<NUM_SUBTABLE_VECTORS; port_idx++) {
-            struct pvector *pvec = &cls->subtables[port_idx];
-            PVECTOR_FOR_EACH (subtable, pvec) {
-                subtable_idx = hash_pointer(subtable, 210365) % 
NUM_SUBTABLE_SLOTS;
-                pvector_change_priority(pvec, subtable,
-                        cls->hit_cnt[port_idx][subtable_idx]);
+        int port_idx;
+
+        for (port_idx = 0; port_idx < NUM_SUBTABLE_VECTORS; port_idx++) {
+            struct pvector_impl *pvec = cls->subtables[port_idx];
+            struct pvector_entry *entry;
+            int index;
+
+            pvector_impl_sort(pvec);
+
+            PVECTOR_IMPL_FOR_EACH_ENTRY (entry, index, pvec) {
+                entry->priority = 0;
            }
-            pvector_publish(pvec);
        }

        /* Start new measuring interval */
-        memset(cls->hit_cnt, 0, sizeof cls->hit_cnt);
-        cls->next_optimization = now + DPCLS_OPTIMIATION_INTERVAL;
+        cls->next_optimization = now + DPCLS_OPTIMIZATION_INTERVAL;
    }
}

@@ -4500,8 +4497,9 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)
    if (cmap_remove(&subtable->rules, &rule->cmap_node, rule->flow.hash)
        == 0) {
        dpcls_destroy_subtable(cls, subtable);
-        for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {
-            pvector_publish(&cls->subtables[i]);
+
+        for (i = 0; i < NUM_SUBTABLE_VECTORS; i++) {
+            pvector_impl_sort(cls->subtables[i]); /* Removes gaps. */
        }
    }
}
@@ -4559,10 +4557,11 @@ dpcls_lookup(struct dpcls *cls, const struct 
netdev_flow_key keys[],

    /* Select the subtable vector for the in_port */
    uint32_t port_idx = hash_port_no(port_no) % NUM_SUBTABLE_VECTORS;
-    uint64_t *hit_cnt = cls->hit_cnt[port_idx];
    int lookups_match = 0, subtable_pos = 1;
+    struct pvector_entry *entry;
+    int index;

-    PVECTOR_FOR_EACH (subtable, &cls->subtables[port_idx]) {
+    PVECTOR_IMPL_FOR_EACH_ENTRY (entry, index, cls->subtables[port_idx]) {
        const struct netdev_flow_key *mkeys = keys;
        struct dpcls_rule **mrules = rules;
        map_type remains = 0;
@@ -4570,6 +4569,8 @@ dpcls_lookup(struct dpcls *cls, const struct 
netdev_flow_key keys[],

        BUILD_ASSERT_DECL(sizeof remains == sizeof *maps);

+        subtable = entry->ptr;
+
        for (m = 0; m < N_MAPS; m++, mkeys += MAP_BITS, mrules += MAP_BITS) {
            uint32_t hashes[MAP_BITS];
            const struct cmap_node *nodes[MAP_BITS];
@@ -4594,9 +4595,8 @@ dpcls_lookup(struct dpcls *cls, const struct 
netdev_flow_key keys[],
                CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
                    if (OVS_LIKELY(dpcls_rule_matches_key(rule, &mkeys[i]))) {
                        mrules[i] = rule;
-                        /* Update the subtable hit statistics for the in_port 
vector */
-                        int subtable_idx = hash_pointer(subtable, 210365) % 
NUM_SUBTABLE_SLOTS;
-                        hit_cnt[subtable_idx]++;
+                        /* Update the subtable hit stats.  */
+                        entry->priority++;
                        lookups_match += subtable_pos;
                        goto next;
                    }
-- 
2.1.4


_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Re: [ovs-dev] [RFC Patch] dpif-netdev: Sorted subtable vectors per in_port in dpcls

Reply via email to