Thanks a lot Billy, really appreciate your feedback.
My replies inline.

/Antonio

> -----Original Message-----
> From: O Mahony, Billy
> Sent: Friday, June 23, 2017 6:39 PM
> To: Fischetti, Antonio <[email protected]>; [email protected]
> Subject: RE: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC lookup/insert
> for recirculated packets.
> 
> Hi Antonio,
> 
> This is a really interesting patch. Comments inline below.
> 
> Thanks,
> /Billy.
> 
> > -----Original Message-----
> > From: [email protected] [mailto:ovs-dev-
> > [email protected]] On Behalf Of [email protected]
> > Sent: Monday, June 19, 2017 11:12 AM
> > To: [email protected]
> > Subject: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC lookup/insert
> for
> > recirculated packets.
> >
> > From: Antonio Fischetti <[email protected]>
> >
> > When OVS is configured as a firewall, with thousands of active
> concurrent
> > connections, the EMC gets quicly saturated and may come under heavy
> > thrashing for the reason that original and recirculated packets keep
> overwrite
> > existing active EMC entries due to its limited size(8k).
> >
> > This thrashing causes the EMC to be less efficient than the dcpls in
> terms of
> > lookups and insertions.
> >
> > This patch allows to use the EMC efficiently by allowing only the
> 'original'
> > packets to hit EMC. All recirculated packets are sent to classifier
> directly.
> > An empirical threshold (EMC_FULL_THRESHOLD - of 50%) for EMC occupancy
> > is set to trigger this logic. By doing so when EMC utilization exceeds
> > EMC_FULL_THRESHOLD.
> >  - EMC Insertions are allowed just for original packets. EMC insertion
> >    and look up is skipped for recirculated packets.
> >  - Recirculated packets are sent to classifier.
> >
> > This patch depends on the previous one in this series. It's based on
> patch
> > "dpif-netdev: add EMC entry count and %full figure to pmd-stats-show"
> at:
> > https://mail.openvswitch.org/pipermail/ovs-dev/2017-January/327570.html
> >
> > Signed-off-by: Antonio Fischetti <[email protected]>
> > Signed-off-by: Bhanuprakash Bodireddy
> > <[email protected]>
> > Co-authored-by: Bhanuprakash Bodireddy
> > <[email protected]>
> > ---
> > In our Connection Tracker testbench set up with
> >
> >  table=0, priority=1 actions=drop
> >  table=0, priority=10,arp actions=NORMAL  table=0,
> priority=100,ct_state=-
> > trk,ip actions=ct(table=1)  table=1, ct_state=+new+trk,ip,in_port=1
> > actions=ct(commit),output:2  table=1, ct_state=+est+trk,ip,in_port=1
> > actions=output:2  table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
> > table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
> >
> > we saw the following performance improvement.
> >
> > Measured packet Rx rate (regardless of packet loss). Bidirectional test
> with
> > 64B UDP packets.
> > Each row is a test with a different number of traffic streams. The
> traffic
> > generator is set so that each stream establishes one UDP connection.
> > Mpps columns reports the Rx rates on the 2 sides.
> >
> >  Traffic |    Orig    |     Orig      |  +changes  |   +changes
> >  Streams |   [Mpps]   | [EMC entries] |   [Mpps]   | [EMC entries]
> > ---------+------------+---------------+------------+---------------
> >      10  |  3.4, 3.4  |      20       |  3.4, 3.4  |      20
> >     100  |  2.6, 2.7  |     200       |  2.6, 2.7  |     201
> >   1,000  |  2.4, 2.4  |    2009       |  2.4, 2.4  |    1994
> >   2,000  |  2.2, 2.2  |    3903       |  2.2, 2.2  |    3900
> >   3,000  |  2.1, 2.1  |    5473       |  2.2, 2.2  |    4798
> >   4,000  |  2.0, 2.0  |    6478       |  2.2, 2.2  |    5663
> >  10,000  |  1.8, 1.9  |    8070       |  2.0, 2.0  |    7347
> > 100,000  |  1.7, 1.7  |    8192       |  1.8, 1.8  |    8192
> >
> 
> [[BO'M]]
> A few questions on the test:
> Are all the pkts rxd being recirculated?

[AF] Yes, I sent UDP packets with the firewall rules above. Every packets 
goes through the CT module, so after that it is recirculated.

> Are there any flows present where the pkts do not require recirculation?

[AF] No. The flow
table=0,priority=100,ct_state=-trk,ip actions=ct(table=1)
implies that any received packet goes through CT and then it is recirculated.

> Was the rxd rss hash calculation offloaded to the NIC?

[AF] Yes.

> For the cases with larger numbers of flows (10K , 100K) did you
> investigate the results when the EMC is simply switched off?

[AF] No, I just focused on this criteria to check if I could really get some
performance improvement. It's likely that with - say 100k or more flows - it's
better to switch off EMC? 

> 
> Say we have 3000 flows (the lowest figure at which we see a positive
> effect) that means 6000 flows are contending for places in the emc.
> Is the effect we see here to do with disabling recirculated packets in
> particular or just reducing contention on the emc in general. 

[AF] I guess the benefit comes from reducing the continuous recursive 
overwrites in EMC.
With this approach
  => less overhead for insertions
  => it's more likely that original packets do find a matching EMC entry. 
Otherwise
whatever we insert into EMC it's likely to be overwritten before a new packet
of that same flow comes in. When we insert entries for recirculated packets for 
sure we're deleting useful entries referring to some active megaflows and this 
will in turn happen again for the next packets recursively.


I know that
> the recirculated pkt hashes require software hashing albeit a small amount
> so they do make a good category of packet to drop from the emc when
> contention is severe.

[AF] Agree, if we skip EMC for recirculated packets
 1. we save some cpu cycles for hash computation 
 2. we mitigate the EMC thrashing, and this is the main purpose of this patch. 

> 
> Once there are too many entries contending  for any cache it's going to
> become a liability as the lookup_cost + (miss_rate * cost_of_miss) grows
> to be greater that the cost_of_miss. And in that case any scheme where a
> very cheap decision can be made to not insert a certain category of
> packets and to also not check the emc for that same category will reduce
> the cache miss rate.

[AF] agree, thanks for explaining this with clear words, I tried to do the
same but I was not so successful :-)

> 
> But I'm not sure that is_recirculated is the best categorization on which
> to make that decision. Mainly because it is not controllable. In this test
> case 50% pkts were recirculated so by ruling these packets out of
> eligibility for the EMC you get a really large reduction in EMC
> contention. However you can never know ahead of time if any packets will
> be recirc'd. You may have a situation where the EMC is totally swamped
> with 200K flows as above but none of them are re-circ'd packets so ruling
> out those packets will give no benefit.

[AF] agree, this approach is limited to conntrack, or any tunneling usecases.

> 
> 
> >  lib/dpif-netdev.c | 46 ++++++++++++++++++++++++++++++++++++++++--
> > ----
> >  1 file changed, 40 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> fd2ed52..64a3cd4
> > 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -4538,6 +4538,8 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
> >      packet_batch_per_flow_update(batch, pkt, mf);  }
> >
> > +#define EMC_FULL_THRESHOLD 0x0000F000
> > +
> [[BO'M]]
> 
> Use of FULL in #def is a little misleading as it is really indicating the
> threshold after which recirc pkts do not get inserted to the EMC.
> 
> EMC_RECIRCT_NO_INSERT_THRESHOLD maybe?

[AF] you're right, I'll take note of that. Or something like 
EMC_OCCUPANCY_THRESHOLD? Hmm,.. 

> 
> As this #def is used with an & operator not a > operator in the patch it
> needs to be clear that number has the restrictions of
> a) having exactly one bit set so is limited to power of two values.
> b) needs to not greater  than 1u << (EM_FLOW_HASH_SHIFT-1)
> 
> Would suggest that value is based directly off EM_FLOW_HASH_SHIFT
> 
> e.g
> #def EMC_RECIRCT_NO_INSERT_HALF (1u << (EM_FLOW_HASH_SHIFT-1))
> #def EMC_RECIRCT_NO_INSERT_QUARTER (1u << (EM_FLOW_HASH_SHIFT-2))
> #def EMC_RECIRCT_NO_INSERT_EIGHT (1u << (EM_FLOW_HASH_SHIFT-3))
> 
> #def EMC_RECIRCT_NO_INSERT_THRESHOLD  EMC_RECIRCT_NO_INSERT_HALF
> 

[AF] Thanks, that would help also on readability. I used the & operator just for
performance reasons, I'm not quite sure if the > operator would require some 
more 
cpu cycles, I didn't check that.

> 
> >  /* Try to process all ('cnt') the 'packets' using only the exact match
> cache
> >   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]',
> the
> >   * miniflow is copied into 'keys' and the packet pointer is moved at
> the @@ -
> > 4582,6 +4584,19 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
> >              pkt_metadata_prefetch_init(&packets[i+1]->md);
> >          }
> >
> > +        /*
> > +         * EMC lookup is skipped when one or both of the following
> > +         * two cases occurs:
> > +         *
> > +         *   - EMC is disabled.  This is detected from cur_min.
> > +         *
> > +         *   - The EMC occupancy exceeds EMC_FULL_THRESHOLD and the
> > +         *     packet to be classified is being recirculated.  When
> this
> > +         *     happens also EMC insertions are skipped for recirculated
> > +         *     packets.  So that EMC is used just to store entries
> which
> > +         *     are hit from the 'original' packets.  This way the EMC
> > +         *     thrashing is mitigated with a benefit on performance.
> > +         */
> >          if (!md_is_valid) {
> >              pkt_metadata_init(&packet->md, port_no);
> >              miniflow_extract(packet, &key->mf); @@ -4603,11 +4618,18 @@
> > emc_processing(struct dp_netdev_pmd_thread *pmd,
> >          } else {
> >              /* Recirculated packets. */
> >              miniflow_extract(packet, &key->mf);
> > -            if (OVS_LIKELY(cur_min)) {
> > -                key->hash = dpif_netdev_packet_get_rss_hash(packet,
> &key->mf);
> > -                flow = emc_lookup(flow_cache, key);
> > -            } else {
> > +            if (flow_cache->n_entries & EMC_FULL_THRESHOLD) {
> > +                /* EMC occupancy is over the threshold.  We skip EMC
> > +                 * lookup for recirculated packets. */
> >                  flow = NULL;
> > +            } else {
> > +                if (OVS_LIKELY(cur_min)) {
> > +                    key->hash = dpif_netdev_packet_get_rss_hash(packet,
> > +                                    &key->mf);
> > +                    flow = emc_lookup(flow_cache, key);
> > +                } else {
> > +                    flow = NULL;
> > +                }
> >              }
> >          }
> >          key->len = 0; /* Not computed yet. */ @@ -4695,7 +4717,13 @@
> > handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
> >                                               add_actions->size);
> >          }
> >          ovs_mutex_unlock(&pmd->flow_mutex);
> > -        emc_probabilistic_insert(pmd, key, netdev_flow);
> > +        /* When EMC occupancy goes over a threshold we avoid inserting
> new
> > +         * entries for recirculated packets. */
> > +        if (!packet->md.recirc_id) {
> > +            emc_probabilistic_insert(pmd, key, netdev_flow);
> > +        } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) {
> > +            emc_probabilistic_insert(pmd, key, netdev_flow);
> > +        }
> >      }
> >  }
> >
> > @@ -4788,7 +4816,13 @@ fast_path_processing(struct
> > dp_netdev_pmd_thread *pmd,
> >
> >          flow = dp_netdev_flow_cast(rules[i]);
> >
> > -        emc_probabilistic_insert(pmd, &keys[i], flow);
> > +        /* When EMC occupancy goes over a threshold we avoid inserting
> new
> > +         * entries for recirculated packets. */
> > +        if (!packet->md.recirc_id) {
> > +            emc_probabilistic_insert(pmd, &keys[i], flow);
> > +        } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) {
> > +            emc_probabilistic_insert(pmd, &keys[i], flow);
> > +        }
> >          dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,
> > n_batches);
> >      }
> >
> > --
> > 2.4.11
> >
> > _______________________________________________
> > dev mailing list
> > [email protected]
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to