On Wed, Aug 16, 2017 at 10:31 AM, Darrell Ball <[email protected]> wrote:
> Something happened to your email – it is mostly blank lines; also inserted > b/w lines belonging to same paragraph ? > It looks like the problem is related to the receiving email client and possibly some some special formatting. This looks fine in gmail. > I have a few clarifications about the other lines. > > > > -----Original Message----- > From: Jan Scheurich <[email protected]> > Date: Wednesday, August 16, 2017 at 9:23 AM > To: "Fischetti, Antonio" <[email protected]>, Darrell Ball < > [email protected]>, "[email protected]" <[email protected]> > Subject: RE: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert > for recirc packets > > Hi, > > > > I agree that in the event of EMC overload it is beneficial to reduce > the number of EMC insertions and lookups as they just generate overhead and > degrade overall throughput. At the same time we want to keep as much of the > EMC acceleration as possible for a fraction of traffic that can benefit > from EMC most. > > > > For EMC insertion we have already done earlier this by introducing > probabilistic EMC insertion, which greatly reduces the costly effect of EMC > thrashing. But we didn't touch the lookup part. How should we select the > packets (or rather packet datapath traversals) for which to perform lookup? > > > > There are several proposals in the air: Only do it for the first pass, > not for recirculated packets, only do it for RSS hash values below a > (dynamic) threshold, possibly others. > > > > For EMC insertion we consciously settled on a random selection as the > datapath has no a priori insight into which flows are better candidates > than others and big flows that benefit most have a higher chance of getting > cached. > > > > Is there a reason to assume that a deterministic selection on some > non-random criteria like the recirculation count will on average (over > deployments and applications) give a better performance than a random > selection? > > > > I don't believe so. For example, the number of "EMC flows" in each > pass through the datapath can differ hugely: 1 GRE tunnel flow in first > pass (from phy port), 100K tenant flows after tunnel decapsulation. Or 100K > tenant flows in first pass (from VM) but 1 flow after NSH encapsulation in > second pass. > > > > I believe a random selection with dynamically adapted probability is > the best we can do without a priori knowledge about the traffic patterns > and pipeline organization. > > > > The RSS hash threshold method looks like the only pseudo-random > criterion that we can use that produces consistent result for every packet > of a flow and does require more information. Of course elephant flows with > an unlucky hash value might never get to use the EMC, but that risk we have > with any stateless selection scheme. > > [Darrell] It is probably something I know by another name, but JTBC, can > you define the “RSS hash threshold method” ? > > > The new thing required will be the dynamic adjustment of lookup > probability to the EMC fill level and/or hit ratio. > > > [Darrell] Did you mean insertion probability rather than lookup > probability ? > > > > Any ideas for that? I guess we'd need a scheme that periodically increases > the probability again to probe for changed traffic patterns. > > > > Once we have that I think the same dynamic probability could be > possible to use also for probabilistic EMC insertion. > > > > BR, Jan > > > > > -----Original Message----- > > > From: [email protected] [mailto:ovs-dev- > > > [email protected]] On Behalf Of Fischetti, Antonio > > > Sent: Wednesday, 16 August, 2017 14:42 > > > To: Darrell Ball <[email protected]>; [email protected] > > > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC > lookup/insert > > > for recirc packets > > > > > > > > > > -----Original Message----- > > > > From: Darrell Ball [mailto:[email protected]] > > > > Sent: Wednesday, August 16, 2017 9:09 AM > > > > To: Fischetti, Antonio <[email protected]>; > > > [email protected] > > > > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC > lookup/insert > > > for > > > > recirc packets > > > > > > > > > > > > > > > > -----Original Message----- > > > > From: "Fischetti, Antonio" <[email protected]> > > > > Date: Tuesday, August 15, 2017 at 6:55 AM > > > > To: Darrell Ball <[email protected]>, "[email protected]" > > > > <[email protected]> > > > > Subject: RE: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC > lookup/insert > > > for > > > > recirc packets > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Darrell Ball [mailto:[email protected]] > > > > > Sent: Monday, August 14, 2017 7:27 AM > > > > > To: Fischetti, Antonio <[email protected]>; > > > [email protected] > > > > > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC > > > lookup/insert > > > > for > > > > > recirc packets > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: <[email protected]> on behalf of > > > > > "[email protected]" <[email protected]> > > > > > Date: Friday, August 11, 2017 at 8:52 AM > > > > > To: "[email protected]" <[email protected]> > > > > > Subject: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC > > > lookup/insert for > > > > > recirc packets > > > > > > > > > > When OVS is configured as a firewall, with thousands of > active > > > > > concurrent connections, the EMC gets quicly saturated > and may > > > > > come under heavy thrashing for the reason that original > and > > > > > recirculated packets keep overwriting the existing > active EMC > > > > > entries due to its limited size (8k). > > > > > > > > > > > > > > > The recirculated packet could have been modified, in which > case, > > > maybe we > > > > > still want to do the emc lookup/insert ? > > > > > > > > [Antonio] > > > > IMPO I'd say we should still skip emc anyway, because the > purpose is > > > to > > > > mitigate thrashing when emc is full. So any recirculated > packet should > > > > be classified at the dpcls/ofproto layers. > > > > I don't know if I'm missing something from your question? > > > > > > > > We can expect that a recirc pkt that has been modified - > similarly to > > > all > > > > other recirculated pkts - could result in a miss when emc is > full. > > > > Later we should do an emc insertion that is likely to > overwrite some > > > > active entry. And recursively, this new insertion itself could > be > > > > overwritten - due to the shortage of locations - even before > it is hit > > > > again. This proposal is to mitigate the thrashing with the > criteria of > > > > reserving emc usage to original packets only. > > > > So a limited resource like emc hopefully could be used more > > > efficiently, > > > > especially when there is more than 1 recirculation. > > > > I guess that adding an exception for modified recirc pkts > could also > > > > drop a bit the throughtput as we should add another if > statement > > > inside > > > > emc_processing. > > > > > > > > [Darrell] > > > > I’ll can drop the edited packet case as my concern was really more > > > general. > > > > The concern is that recirculated packets should still be forwarded > quickly > > > if > > > > possible > > > > and using emc should help that. The first time through, emc is > used for > > > the > > > > packet and then the second > > > > time through, emc is not used, so it is slower. But, possibly the > argument > > > > could be made that since it is recirculated, > > > > it is already slower, in which case, maybe a penalty for > recirculated > > > packets > > > > is reasonable. > > > > > > [Antonio] > > > Agree. Other than that, in case of an emc congestion - eg a firewall > with > > > say 6,000 connections - with a lot of overwrites, the effect could > be that > > > a lot of lookups will fail and the new insertions are just > overwriting active > > > flows. This keeps a high failure for lookups and the continuous > overwrites > > > for insertions become an overhead. So in this case there's a penalty > > > as for the original (ie the 1st time through) as for the > recirculated packets. > > > With this approach we are considering that with 6,000 flows we would > > > need at > > > least 12,000 entries with 1 recirculation. So one strategy to reduce > > > thrashing > > > could be to restrict emc usage to original packets only. The > counterpart is > > > that recirculated packets are slower, but the overall effect should > be a > > > benefit. > > > > > > > > > > Instead of having a simple 50% black and white cutoff, maybe a > penalty > > > to the > > > > insertion probability could be used ? > > > > > > [Antonio] > > > Yes, at the beginning I was considering this solution. I then > preferred > > > the current one because it allows not only to skip insertions but > also > > > to skip lookups, especially when RSS hash must be computed in > software. > > > > > > The check of the threshold - as this is happening inside > emc_processing - > > > is done with an '&' operation so to use as less cpu cycles as > possible. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This thrashing causes the EMC to be less efficient than > the dcpls > > > > > in terms of lookups and insertions. > > > > > > > > > > This patch allows to use the EMC efficiently by allowing > only > > > > > the 'original' packets to hit EMC. All recirculated > packets are > > > > > sent to the classifier directly. > > > > > An empirical threshold EMC_RECIRCT_NO_INSERT_THRESHOLD - > > > of 50% - > > > > > for EMC occupancy is set to trigger this logic. By doing > so when > > > > > EMC utilization exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD: > > > > > - EMC Insertions are allowed just for original packets. > > > > > EMC insertion and look up are skipped for > recirculated packets. > > > > > - Recirculated packets are sent to the classifier. > > > > > > > > > > This patch is based on patch > > > > > "dpif-netdev: add EMC entry count and %full figure to > pmd-stats- > > > show" > > > > at: > > > > > https://urldefense.proofpoint.com/v2/url?u=https- > > > > > 3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017- > > > > > > > > > > > > 2DJanuary_327570.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BV > > > hFA09CGX7JQ5Ih- > > > > > uZnsw&m=NHY06RD- > > > Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=- > > > > > PhWyltJ71UipVzd1D0H0I9k4uSTLdCJ_zanXxHd7fo&e= > > > > > > > > > > CC: Jan Scheurich <[email protected]> > > > > > Signed-off-by: Antonio Fischetti < > [email protected]> > > > > > Signed-off-by: Bhanuprakash Bodireddy > > > > <[email protected]> > > > > > Co-authored-by: Bhanuprakash Bodireddy > > > > <[email protected]> > > > > > --- > > > > > Connection Tracker testbench set up with > > > > > > > > > > table=0, priority=1 actions=drop > > > > > table=0, priority=10,arp actions=NORMAL > > > > > table=0, priority=100,ct_state=-trk,ip > actions=ct(table=1) > > > > > table=1, ct_state=+new+trk,ip,in_port=1 > > > actions=ct(commit),output:2 > > > > > table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2 > > > > > table=1, ct_state=+new+trk,ip,in_port=2 actions=drop > > > > > table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1 > > > > > > > > > > 2 PMDs, 3 Tx queues. > > > > > > > > > > I measured packet Rx rate (regardless of packet loss). > > > Bidirectional > > > > > test with 64B UDP packets. > > > > > Each row is a test with a different number of traffic > streams. The > > > > traffic > > > > > generator is set so that each stream establishes one UDP > > > connection. > > > > > Mpps columns reports the Rx rates on the 2 sides. > > > > > > > > > > I set up the generator to loop on the dest IP addr on > one side, > > > > > and loop instead on the source IP addr on the other side. > > > > > > > > > > For example to generate 10 different flows, I was > sending to phy > > > port > > > > #1 > > > > > UDP, IPsrc:10.10.10.10, IPdest: 20.20.20.[20-29], > PortSrc: 63, > > > > PortDest: 63 > > > > > > > > > > Instead to phy port #2 (source and dest IPs are now > swapped): > > > > > UDP, IPsrc: 20.20.20.[20-29], IPdest: 10.10.10.10, > PortSrc: 63, > > > > PortDest: > > > > > 63 > > > > > > > > > > I saw the following performance improvement. > > > > > > > > > > Original OvS-DPDK means at Commit ID: > > > > > 6b1babacc3ca0488e07596bf822fe356c9bab646 > > > > > > > > > > +----------------------+------ > -----------------+ > > > > > | Original OvS-DPDK | Original OvS-DPDK > | > > > > > | | + this patch > | > > > > > ---------+------------+------- > --+------------+----------+ > > > > > Traffic | Rx | EMC | Rx | EMC > | > > > > > Streams | [Mpps] | entries | [Mpps] | entries > | > > > > > ---------+------------+------- > --+------------+----------+ > > > > > 100 | 2.43, 2.49 | 200 | 2.55, 2.57 | 201 > | > > > > > 1,000 | 2.01, 2.02 | 2007 | 2.12, 2.12 | 2006 > | > > > > > 2,000 | 1.93, 1.95 | 3868 | 1.98, 1.96 | 3884 > | > > > > > 3,000 | 1.87, 1.91 | 5086 | 1.97, 1.97 | 4757 > | > > > > > 4,000 | 1.83, 1.82 | 6173 | 1.94, 1.93 | 5280 > | > > > > > 10,000 | 1.67, 1.69 | 7826 | 1.82, 1.81 | 7090 > | > > > > > 30,000 | 1.57, 1.59 | 8192 | 1.66, 1.67 | 8192 > | > > > > > ---------+------------+------- > --+------------+----------+ > > > > > > > > > > This test setup implies 1 recirculation on each received > packet. > > > > > We didn't check this patch in a test scenario where more > than 1 > > > > > recirculation is occurring per packet. > > > > > --- > > > > > lib/dpif-netdev.c | 65 > > > > > +++++++++++++++++++++++++++++++++++++++++++++++++++---- > > > > > 1 file changed, 61 insertions(+), 4 deletions(-) > > > > > > > > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > > > > > index bea1c3f..8f6b96b 100644 > > > > > --- a/lib/dpif-netdev.c > > > > > +++ b/lib/dpif-netdev.c > > > > > @@ -4663,6 +4663,9 @@ dp_netdev_queue_batches(struct > > > dp_packet *pkt, > > > > > packet_batch_per_flow_update(batch, pkt, mf); > > > > > } > > > > > > > > > > +/* Threshold to skip EMC for recirculated packets. */ > > > > > +#define EMC_RECIRCT_NO_INSERT_THRESHOLD 0xFFFFF000 > > > > > + > > > > > /* Try to process all ('cnt') the 'packets' using only > the exact > > > > match > > > > > cache > > > > > * 'pmd->flow_cache'. If a flow is not found for a > packet > > > > 'packets[i]', > > > > > the > > > > > * miniflow is copied into 'keys' and the packet > pointer is moved > > > at > > > > the > > > > > @@ -4714,8 +4717,36 @@ emc_processing(struct > > > dp_netdev_pmd_thread > > > > *pmd, > > > > > key->len = 0; /* Not computed yet. */ > > > > > key->hash = dpif_netdev_packet_get_rss_hash(packet, > &key- > > > > >mf); > > > > > > > > > > - /* If EMC is disabled skip emc_lookup */ > > > > > - flow = (cur_min == 0) ? NULL: > emc_lookup(flow_cache, key); > > > > > + /* > > > > > + * EMC lookup is skipped when one or both of > the following > > > > > + * two cases occurs: > > > > > + * > > > > > + * - EMC is disabled. This is detected from > cur_min. > > > > > + * > > > > > + * - The EMC occupancy exceeds > > > > EMC_RECIRCT_NO_INSERT_THRESHOLD > > > > > and > > > > > + * the packet to be classified is being > recirculated. > > > > When > > > > > this > > > > > + * happens also EMC insertions are skipped > for > > > > recirculated > > > > > + * packets. So that EMC is used just to > store entries > > > > which > > > > > + * are hit from the 'original' packets. > This way the > > > > EMC > > > > > + * thrashing is mitigated with a benefit on > > > > performance. > > > > > + */ > > > > > + if (OVS_LIKELY(cur_min)) { > > > > > + if (!md_is_valid) { > > > > > + flow = emc_lookup(flow_cache, key); > > > > > + } else { > > > > > + /* Recirculated packet. */ > > > > > + if (flow_cache->n_entries & > > > > > EMC_RECIRCT_NO_INSERT_THRESHOLD) { > > > > > + /* EMC occupancy is over the > threshold. We skip > > > > EMC > > > > > + * lookup for recirculated packets. > */ > > > > > + flow = NULL; > > > > > + } else { > > > > > + flow = emc_lookup(flow_cache, key); > > > > > + } > > > > > + } > > > > > + } else { > > > > > + flow = NULL; > > > > > + } > > > > > + > > > > > if (OVS_LIKELY(flow)) { > > > > > dp_netdev_queue_batches(packet, flow, > &key->mf, > > > batches, > > > > > n_batches); > > > > > @@ -4800,7 +4831,20 @@ handle_packet_upcall(struct > > > > dp_netdev_pmd_thread > > > > > *pmd, > > > > > > add_actions->size); > > > > > } > > > > > ovs_mutex_unlock(&pmd->flow_mutex); > > > > > - emc_probabilistic_insert(pmd, key, netdev_flow); > > > > > + /* EMC insertion can be skipped by a > probabilistic criteria > > > > or > > > > > + * - in case of recirculated packets - > depending on the > > > > number of > > > > > + * EMC entries. */ > > > > > + if (!packet->md.recirc_id) { > > > > > + emc_probabilistic_insert(pmd, key, > netdev_flow); > > > > > + } else { > > > > > + /* Recirculated packets. When EMC > occupancy goes over > > > > > + * a threshold we avoid inserting new > entries. */ > > > > > + if (!(pmd->flow_cache.n_entries & > > > > > + EMC_RECIRCT_NO_INSERT_THRESHOLD)) { > > > > > + /* Still under the threshold. */ > > > > > + emc_probabilistic_insert(pmd, key, > netdev_flow); > > > > > + } > > > > > + } > > > > > } > > > > > } > > > > > > > > > > @@ -4893,7 +4937,20 @@ fast_path_processing(struct > > > > dp_netdev_pmd_thread > > > > > *pmd, > > > > > > > > > > flow = dp_netdev_flow_cast(rules[i]); > > > > > > > > > > - emc_probabilistic_insert(pmd, &keys[i], flow); > > > > > + /* EMC insertion can be skipped by a > probabilistic criteria > > > > or > > > > > + * - in case of recirculated packets - > depending on the > > > > number of > > > > > + * EMC entries. */ > > > > > + if (!packet->md.recirc_id) { > > > > > + emc_probabilistic_insert(pmd, &keys[i], > flow); > > > > > + } else { > > > > > + /* Recirculated packets. When EMC > occupancy goes over > > > > > + * a threshold we avoid inserting new > entries. */ > > > > > + if (!(pmd->flow_cache.n_entries & > > > > > + EMC_RECIRCT_NO_INSERT_THRESHOLD)) { > > > > > + /* Still under the threshold. */ > > > > > + emc_probabilistic_insert(pmd, &keys[i], > flow); > > > > > + } > > > > > + } > > > > > dp_netdev_queue_batches(packet, flow, > &keys[i].mf, > > > batches, > > > > > n_batches); > > > > > } > > > > > > > > > > -- > > > > > 2.4.11 > > > > > > > > > > _______________________________________________ > > > > > dev mailing list > > > > > [email protected] > > > > > https://urldefense.proofpoint.com/v2/url?u=https- > > > > > 3A__mail.openvswitch.org_mailman_listinfo_ovs- > > > > > > > > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih- > > > > uZnsw&m=NHY06RD- > > > > > Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=- > > > xSW7voYnxrudlh_WPXXsKJ1n1o680- > > > > > 3ZCuwj33q0H8&e= > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > dev mailing list > > > [email protected] > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__mail. > openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwIGaQ&c= > Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=dGZmbKhBG9tJHY4odedsGA&m= > TlTavCfm2NzTMaeBux9jVUZlCVRoTGmcyPqI2Yq-zfU&s=YgHbNLy7Rm164X_ > HzR1dLam6mU2jyht7EGdPDJBumrs&e= > > > > _______________________________________________ > dev mailing list > [email protected] > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
