On 03.07.2017 20:33, Darrell Ball wrote:
> 
> 
> On 7/3/17, 7:31 AM, "ovs-dev-boun...@openvswitch.org on behalf of Jan 
> Scheurich" <ovs-dev-boun...@openvswitch.org on behalf of 
> jan.scheur...@ericsson.com> wrote:
> 
>     I like this generic approach to collect the packets to be output per port 
> for each Rx batch in dpif-netdev. It is indeed simpler than the approach in 
> [1].
>     
>     However, [1] originally had a larger scope, namely to buffer packets in 
> an intermediate queue per netdev tx queue *across* multiple rx batches. 
> Unfortunately this aspect was "optimized" away in v3 to minimize the impact 
> on minimum and average latency for ports with little traffic.
>     
>     Limiting the output batching to a single rx batch has little benefit 
> unless the average rx batch size is significant. According to our 
> measurements with instrumented netdev-dpdk datapath this happens only very 
> close to or above PMD saturation. At realistic production load levels (say 
> 95% PMD processing cycles) the average rx batch size is still around 1-2 
> only, so the gain we can expect is negligible. The primary use case for this 
> patch appears to be to push up the saturation throughput in benchmarks.
> 
> [Darrell]
> This patch series suggested by Ilya is mainly a subset of the series 
> suggested by Bhanu here
> [1] https://patchwork.ozlabs.org/patch/706461/
> and hence are not really comparable.
> Ilya’s series/patch collects packets per rx batch and sends each subset as a 
> tx batch per output. As pointed out here, by Jan and
> Has also come up many times before, real rx batch sizes are themselves small. 
> Also, in general, an implementation should not
> rely on a very lucky distribution of input packets from one rx batch to 
> provide some benefit. I also think the benefit
> will be negligible in the general case. It would be good to see what the 
> extra cost is for the additional checking when 
> the general case of negligible benefit is in effect.
> 
> I agree one valid suggested concept of Ilya patch set is trying to put the 
> code in a common code path across
> netdevs. This may be an oversimplification in some cases. If tx batching 
> belongs at the netdev layer in some cases,
> I don’t think we should put some at dpif-netdev and some in netdev.  Mixing 
> layers for one function (i.e. tx batching)
> will be more confusing and less maintainable.
> 
> 
>     
>     In our perspective a perhaps more important use case for tx batching 
> would be to reduce the number of virtio interrupts a PMD needs to trigger in 
> Qemu/KVM guests using the kernel virtio-net driver (currently one per tx 
> batch).
>     
>     In internal benchmarks with kernel applications we have observed that the 
> OVS PMD spends up to 30% of its cycles kicking the eventfd and the guest OS 
> in the VM spends 50% or more of a vCPU processing virtio-net interrupts. This 
> seriously degrades the performance of both OVS and application.
>     
>     With a vhostuser tx batching patch that is not limited to a single Rx 
> batch but relies on periodic flushing (e.g. every 50-100 us), we have been 
> able to reduce this interrupt overhead and significantly improve e.g. the 
> iperf performance.
>     
>     My suggestion would be to complement the generic output packet batching 
> per rx batch in dpif-netdev in this patch with specific support for tx 
> batching for vhostuser ports in netdev-dpdk, using an intermediate queue with 
> a configurable flushing interval (including the possibility to turn buffering 
> off when latency is an issue).
> 
> [Darrell]
> If some tx batching logically belongs at the netdev layer, then we should put 
> all of it there, since in this 
> situation the behavior really is “class or subclass” specific and we should 
> recognize that.
> I don’t think we should mix and match tx batching across layers.
> 
> The general idea of decoupling rx and tx, which is Bhanu’s patch ultimate 
> direction, has merit in some situations, especially
> where added latency is less important vs thoughput. This would ultimately
> allow collecting packets of a TX batch across different RXs and over longer 
> time than a single RX batch.
> However, as Jan points out, there is going to be a tradeoff b/w latency and 
> throughput; I don’t think this is a
> surprise to anyone. That logically leads us to more configuration knobs, 
> which I think is ok here.
> 
> 
>     
>     BR, Jan
>     
>     > -----Original Message-----
>     > From: ovs-dev-boun...@openvswitch.org 
> [mailto:ovs-dev-boun...@openvswitch.org] On Behalf Of Ilya Maximets
>     > Sent: Friday, 30 June, 2017 14:03
>     > To: ovs-dev@openvswitch.org; Bhanuprakash Bodireddy 
> <bhanuprakash.bodire...@intel.com>
>     > Cc: Heetae Ahn <heetae82....@samsung.com>; Ilya Maximets 
> <i.maxim...@samsung.com>
>     > Subject: [ovs-dev] [PATCH 0/3] Output packet batching.
>     > 
>     > This patch-set inspired by [1] from Bhanuprakash Bodireddy.
>     > Implementation of [1] looks very complex and introduces many pitfalls 
> for
>     > later code modifications like possible packet stucks.
>     > 
>     > This version targeted to make simple and flexible output packet 
> batching on
>     > higher level without introducing and even simplifying netdev layer.
>     > 
>     > Patch set consists of 3 patches. All the functionality introduced in the
>     > first patch. Two others are just cleanups of netdevs to not do 
> unnecessary
>     > things.
>     > 
>     > Basic testing of 'PVP with OVS bonding on phy ports' scenario shows
>     > significant performance improvement.
>     > More accurate and intensive testing required.
>     > 
>     > [1] [PATCH 0/6] netdev-dpdk: Use intermediate queue during packet 
> transmission.
>     >     
> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-2DJune_334762.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=eLmWLZA0Vfd-wURZq3uMbHM7NQ9YoK9_5EmJn27gsbg&s=RfKtd9eOmZJKMCN5_yrODHXgvt_GatMcZzsxG3UCcTc&e=
>  
>     > 
>     > Ilya Maximets (3):
>     >   dpif-netdev: Output packet batching.
>     >   netdev: Remove unused may_steal.
>     >   netdev: Remove useless cutlen.
>     > 
>     >  lib/dpif-netdev.c     | 81 
> ++++++++++++++++++++++++++++++++++++---------------
>     >  lib/netdev-bsd.c      |  7 ++---
>     >  lib/netdev-dpdk.c     | 30 +++++++------------
>     >  lib/netdev-dummy.c    |  6 ++--
>     >  lib/netdev-linux.c    |  7 ++---
>     >  lib/netdev-provider.h |  7 ++---
>     >  lib/netdev.c          | 12 +++-----
>     >  lib/netdev.h          |  2 +-
>     >  8 files changed, 83 insertions(+), 69 deletions(-)
>     > 
>     > --
>     > 2.7.4
>     > 
>     > _______________________________________________
>     > dev mailing list
>     > d...@openvswitch.org
>     > 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=eLmWLZA0Vfd-wURZq3uMbHM7NQ9YoK9_5EmJn27gsbg&s=KWtbIWCY_rZ7Ian6iLj6V7dq7Ingl5Y129PDR9_AFdE&e=
>  
>     _______________________________________________
>     dev mailing list
>     d...@openvswitch.org
>     
> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=eLmWLZA0Vfd-wURZq3uMbHM7NQ9YoK9_5EmJn27gsbg&s=KWtbIWCY_rZ7Ian6iLj6V7dq7Ingl5Y129PDR9_AFdE&e=
>  


Hi Darrell and Jan.
Thanks for looking at this. I agree with Darrell that mixing implementations
on two different levels is a bad idea, but as I already wrote in reply to
Bhanuprakash [2], there is no issues with implementing of output batching
of more than one rx batch.

[2] https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/334808.html

Look at the incremental below. This is how it may look like:

---8<---------------------------------------------------------------->8---
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 38282bd..360e737 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -710,6 +710,12 @@ dp_packet_batch_is_empty(const struct dp_packet_batch 
*batch)
     return !dp_packet_batch_size(batch);
 }
 
+static inline bool
+dp_packet_batch_is_full(const struct dp_packet_batch *batch)
+{
+    return dp_packet_batch_size(batch) == NETDEV_MAX_BURST;
+}
+
 #define DP_PACKET_BATCH_FOR_EACH(PACKET, BATCH)    \
     for (size_t i = 0; i < dp_packet_batch_size(BATCH); i++)  \
         if (PACKET = BATCH->packets[i], true)
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index e49f665..181dcb8 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -84,6 +84,9 @@ VLOG_DEFINE_THIS_MODULE(dpif_netdev);
 #define MAX_RECIRC_DEPTH 5
 DEFINE_STATIC_PER_THREAD_DATA(uint32_t, recirc_depth, 0)
 
+/* Use instant packet send by default. */
+#define DEFAULT_OUTPUT_MAX_LATENCY 0
+
 /* Configuration parameters. */
 enum { MAX_FLOWS = 65536 };     /* Maximum number of flows in flow table. */
 enum { MAX_METERS = 65536 };    /* Maximum number of meters. */
@@ -262,6 +265,9 @@ struct dp_netdev {
     struct ovs_mutex meter_locks[N_METER_LOCKS];
     struct dp_meter *meters[MAX_METERS]; /* Meter bands. */
 
+    /* The time that a packet can wait in output batch for sending. */
+    atomic_uint32_t output_max_latency;
+
     /* Probability of EMC insertions is a factor of 'emc_insert_min'.*/
     OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_uint32_t emc_insert_min;
 
@@ -494,6 +500,7 @@ struct tx_port {
     int qid;
     long long last_used;
     struct hmap_node node;
+    long long output_time;
     struct dp_packet_batch output_pkts;
 };
 
@@ -660,7 +667,7 @@ static void dp_netdev_del_rxq_from_pmd(struct 
dp_netdev_pmd_thread *pmd,
                                        struct rxq_poll *poll)
     OVS_REQUIRES(pmd->port_mutex);
 static void dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *,
-                                               long long now);
+                                               long long now, bool force);
 static void reconfigure_datapath(struct dp_netdev *dp)
     OVS_REQUIRES(dp->port_mutex);
 static bool dp_netdev_pmd_try_ref(struct dp_netdev_pmd_thread *pmd);
@@ -1182,6 +1189,7 @@ create_dp_netdev(const char *name, const struct 
dpif_class *class,
 
     conntrack_init(&dp->conntrack);
 
+    atomic_init(&dp->output_max_latency, DEFAULT_OUTPUT_MAX_LATENCY);
     atomic_init(&dp->emc_insert_min, DEFAULT_EM_FLOW_INSERT_MIN);
 
     cmap_init(&dp->poll_threads);
@@ -2848,7 +2856,7 @@ dpif_netdev_execute(struct dpif *dpif, struct 
dpif_execute *execute)
     dp_packet_batch_init_packet(&pp, execute->packet);
     dp_netdev_execute_actions(pmd, &pp, false, execute->flow,
                               execute->actions, execute->actions_len, now);
-    dp_netdev_pmd_flush_output_packets(pmd, now);
+    dp_netdev_pmd_flush_output_packets(pmd, now, true);
 
     if (pmd->core_id == NON_PMD_CORE_ID) {
         ovs_mutex_unlock(&dp->non_pmd_mutex);
@@ -2897,6 +2905,16 @@ dpif_netdev_set_config(struct dpif *dpif, const struct 
smap *other_config)
         smap_get_ullong(other_config, "emc-insert-inv-prob",
                         DEFAULT_EM_FLOW_INSERT_INV_PROB);
     uint32_t insert_min, cur_min;
+    uint32_t output_max_latency, cur_max_latency;
+
+    output_max_latency = smap_get_int(other_config, "output-max-latency",
+                                      DEFAULT_OUTPUT_MAX_LATENCY);
+    atomic_read_relaxed(&dp->output_max_latency, &cur_max_latency);
+    if (output_max_latency != cur_max_latency) {
+        atomic_store_relaxed(&dp->output_max_latency, output_max_latency);
+        VLOG_INFO("Output maximum latency set to %"PRIu32" ms",
+                  output_max_latency);
+    }
 
     if (!nullable_string_is_equal(dp->pmd_cmask, cmask)) {
         free(dp->pmd_cmask);
@@ -3085,26 +3103,34 @@ cycles_count_end(struct dp_netdev_pmd_thread *pmd,
 }
 
 static void
+dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
+                                   struct tx_port *p, long long now)
+{
+    int tx_qid;
+    bool dynamic_txqs;
+
+    dynamic_txqs = p->port->dynamic_txqs;
+    if (dynamic_txqs) {
+        tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p, now);
+    } else {
+        tx_qid = pmd->static_tx_qid;
+    }
+
+    netdev_send(p->port->netdev, tx_qid, &p->output_pkts,
+                dynamic_txqs);
+    dp_packet_batch_init(&p->output_pkts);
+}
+
+static void
 dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
-                                   long long now)
+                                   long long now, bool force)
 {
     struct tx_port *p;
 
     HMAP_FOR_EACH (p, node, &pmd->send_port_cache) {
-        if (!dp_packet_batch_is_empty(&p->output_pkts)) {
-            int tx_qid;
-            bool dynamic_txqs;
-
-            dynamic_txqs = p->port->dynamic_txqs;
-            if (dynamic_txqs) {
-                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p, now);
-            } else {
-                tx_qid = pmd->static_tx_qid;
-            }
-
-            netdev_send(p->port->netdev, tx_qid, &p->output_pkts,
-                        dynamic_txqs);
-            dp_packet_batch_init(&p->output_pkts);
+        if (!dp_packet_batch_is_empty(&p->output_pkts)
+            && (force || p->output_time <= now)) {
+            dp_netdev_pmd_flush_output_on_port(pmd, p, now);
         }
     }
 }
@@ -3128,7 +3154,7 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread 
*pmd,
 
         cycles_count_start(pmd);
         dp_netdev_input(pmd, &batch, port_no, now);
-        dp_netdev_pmd_flush_output_packets(pmd, now);
+        dp_netdev_pmd_flush_output_packets(pmd, now, false);
         cycles_count_end(pmd, PMD_CYCLES_PROCESSING);
     } else if (error != EAGAIN && error != EOPNOTSUPP) {
         static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
@@ -3663,6 +3689,8 @@ pmd_free_cached_ports(struct dp_netdev_pmd_thread *pmd)
 {
     struct tx_port *tx_port_cached;
 
+    /* Flush all the queued packets. */
+    dp_netdev_pmd_flush_output_packets(pmd, 0, true);
     /* Free all used tx queue ids. */
     dpif_netdev_xps_revalidate_pmd(pmd, 0, true);
 
@@ -4388,6 +4416,7 @@ dp_netdev_add_port_tx_to_pmd(struct dp_netdev_pmd_thread 
*pmd,
 
     tx->port = port;
     tx->qid = -1;
+    tx->output_time = 0LL;
     dp_packet_batch_init(&tx->output_pkts);
 
     hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port->port_no));
@@ -5054,8 +5083,18 @@ dp_execute_cb(void *aux_, struct dp_packet_batch 
*packets_,
             }
             dp_packet_batch_apply_cutlen(packets_);
 
+            if (dp_packet_batch_is_empty(&p->output_pkts)) {
+                uint32_t cur_max_latency;
+
+                atomic_read_relaxed(&dp->output_max_latency, &cur_max_latency);
+                p->output_time = now + cur_max_latency;
+            }
+
             DP_PACKET_BATCH_FOR_EACH (packet, packets_) {
                 dp_packet_batch_add(&p->output_pkts, packet);
+                if (OVS_UNLIKELY(dp_packet_batch_is_full(&p->output_pkts))) {
+                    dp_netdev_pmd_flush_output_on_port(pmd, p, now);
+                }
             }
             return;
         }
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index abfe397..34ee908 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -344,6 +344,21 @@
         </p>
       </column>
 
+      <column name="other_config" key="output-max-latency"
+              type='{"type": "integer", "minInteger": 0, "maxInteger": 1000}'>
+        <p>
+          Specifies the time in milliseconds that a packet can wait in output
+          batch for sending i.e. amount of time that packet can spend in an
+          intermediate output queue before sending to netdev.
+          This option can be used to configure balance between throughput
+          and latency. Lower values decreases latency while higher values
+          may be useful to achieve higher performance.
+        </p>
+        <p>
+          Defaults to 0 i.e. instant packet sending (latency optimized).
+        </p>
+      </column>
+
       <column name="other_config" key="n-handler-threads"
               type='{"type": "integer", "minInteger": 1}'>
         <p>

---8<---------------------------------------------------------------->8---

I can easily change milliseconds granularity to microseconds if you wish.
For the final version part of this incremental will be squashed to the first 
patch.

One difference with netdev layer solution is that we batching only packets
recieved by current thread not mixing packets from different threads.
I think that it's an advantage of this solution because we will never
have performance issues connected to flushing from the non-local thread.
(see issue description in my reply to Bhanuprakash [2])

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to