Hi Ian, Further responses are inline.
> -----Original Message----- > From: Stokes, Ian <[email protected]> > Sent: Wednesday 16 June 2021 13:47 > To: Ferriter, Cian <[email protected]>; [email protected]; Van > Haaren, Harry <[email protected]>; > [email protected] > Subject: RE: [ovs-dev] [v12 15/16] dpif-netdev: Optimize dp output action. > > > Hi Ian, > > > > Thanks for the review. My responses are inline. > > > > > -----Original Message----- > > > From: Stokes, Ian <[email protected]> > > > Sent: Tuesday 15 June 2021 18:04 > > > To: Ferriter, Cian <[email protected]>; [email protected]; > > Ferriter, Cian <[email protected]>; Van Haaren, Harry > > > <[email protected]> > > > Cc: [email protected] > > > Subject: RE: [ovs-dev] [v12 15/16] dpif-netdev: Optimize dp output action. > > > > > > > From: Harry van Haaren <[email protected]> > > > > > > > > This commit optimizes the output action, by enabling the compiler to > > > > optimize the code better through reducing code complexity. > > > > > > > > The core concept of this optimization is that the array-length checks > > > > have already been performed above the copying code, so can be removed. > > > > Removing of the per-packet length checks allows the compiler to auto- > > vectorize > > > > the stores using SIMD registers. > > > > > > > > Signed-off-by: Harry van Haaren <[email protected]> > > > > > > > > > > Thanks for the patch Cian/Harry. > > > > > > Comments inline. > > > > --- > > > > > > > > v8: Add NEWS entry. > > > > --- > > > > NEWS | 1 + > > > > lib/dpif-netdev.c | 23 ++++++++++++++++++----- > > > > 2 files changed, 19 insertions(+), 5 deletions(-) > > > > > > > > diff --git a/NEWS b/NEWS > > > > index d04dac746..a7ec34af1 100644 > > > > --- a/NEWS > > > > +++ b/NEWS > > > > @@ -17,6 +17,7 @@ Post-v2.15.0 > > > > * Enable the AVX512 DPCLS implementation to use VPOPCNT > > > > instruction > > if > > > > the > > > > CPU supports it. This enhances performance by using the native > > vpopcount > > > > instructions, instead of the emulated version of vpopcount. > > > > + * Optimize dp_netdev_output by enhancing compiler optimization > > potential. > > > > - ovs-ctl: > > > > * New option '--no-record-hostname' to disable hostname > > > > configuration > > > > in ovsdb on startup. > > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > > > > index 04a4f71cb..f46269ae3 100644 > > > > --- a/lib/dpif-netdev.c > > > > +++ b/lib/dpif-netdev.c > > > > @@ -7290,12 +7290,25 @@ dp_execute_output_action(struct > > > > dp_netdev_pmd_thread *pmd, > > > > pmd->n_output_batches++; > > > > } > > > > > > > > - struct dp_packet *packet; > > > > - DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) { > > > > - p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] = > > > > - pmd->ctx.last_rxq; > > > > - dp_packet_batch_add(&p->output_pkts, packet); > > > > + /* The above checks ensure that there is enough space in the output > > batch. > > > > + * Using dp_packet_batch_add() has a branch to check if the batch > > > > is full. > > > > + * This branch reduces the compiler's ability to optimize > > > > efficiently. The > > > > + * below code implements packet movement between batches without > > > > checks, > > > > + * with the required semantics of output batch perhaps containing > > packets. > > > > + */ > > > > > > What is the performance gain recorded with this approach with compiler > > optimizations? > > > > > > > The test was run using the scalar DPIF, 1 flow and EMC on, where the > > before/after cycle cost of the patch is 243 cycles before to 230 cycles per > > packet > > after. Hence this optimization saves 13 cycles per packet in my > > measurements. > > So I think that info is important and should be added to the commit message > and will help inform whether to make the change. > I added this information to the commit message. > Ilya has worked in this area extensively so would be interested to hear his > opinion? > > I don't think this is a blocker to the DPIF work so could possibly be > submitted separately as a patch for master? > Correct, it doesn't block the DPIF work. I'll submit as a separate patch. > Regards > Ian > > > > > > + int batch_size = dp_packet_batch_size(packets_); > > > > + int out_batch_idx = dp_packet_batch_size(&p->output_pkts); > > > > + struct dp_netdev_rxq *rxq = pmd->ctx.last_rxq; > > > > + struct dp_packet_batch *output_batch = &p->output_pkts; > > > > + > > > > + for (int i = 0; i < batch_size; i++) { > > > > + struct dp_packet *packet = packets_->packets[i]; > > > > + p->output_pkts_rxqs[out_batch_idx] = rxq; > > > > + output_batch->packets[out_batch_idx] = packet; > > > > + out_batch_idx++; > > > > } > > > > + output_batch->count += batch_size; > > > > > > So I guess the counter argument here is that the previous approach uses > > > the > > dp_packet standard functions. > > > > > > There has been work over time to ensure that dp_packets are handled in a > > similar way using the dp_packet api. > > > > > > This I think reverses the position. That's why I'd like to get a sense of > > > the > > performance gain as I think it's important to justify breaking > > > from code conformance. > > > > > > Has there been thought to modifying the dp_packet_batch_add ? > > > > > > > We thought about adding an "unsafe" version of dp_packet_batch_add() which > > doesn't check whether the packet batch is full. Due to this optimization > > being > > dependent on other checks in the surrounding code, it isn't appropriate to > > have > > a generic function containing this optimization. > > > > > Regards > > > Ian > > > > + > > > > return true; > > > > } > > > > > > > > -- > > > > 2.31.1 > > > > > > > > _______________________________________________ > > > > dev mailing list > > > > [email protected] > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
