Thanks Darrel for your useful comments, I've tried to replicate the usecases you mentioned, please find inline some details.
Regards, Antonio > -----Original Message----- > From: Darrell Ball [mailto:[email protected]] > Sent: Thursday, June 1, 2017 6:20 PM > To: Fischetti, Antonio <[email protected]>; [email protected] > Subject: Re: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation for > established connections. > > Comments inline > > On 5/29/17, 8:24 AM, "[email protected] on behalf of > Fischetti, Antonio" <[email protected] on behalf of > [email protected]> wrote: > > Thanks Joe for your feedback and the interesting insights in conntrack > in your earlier communication. > We have added all the details that we considered for this first > implementation. Also, some answers are inline. > > The purpose of this implementation is to avoid recirculation just for > those packets that are part of established connections. > > This shouldn't affect the packet recirculation for actions other than > conntrack. For example in MPLS, after a pop_mpls action the packet will > still be recirculated to follow the usual datapath. > > Most importantly, the CT module isn't by-passed in this > implementation. > > Besides the recirculation, all other action[s] shall be executed as-is > on each packet. > Any new CT change or action set by the controller will be managed as > usual. > > For our testing we set up a simple firewall, below are the flows. > > > Flow Setup > ---------- > table=0, priority=100,ct_state=-trk,ip actions=ct(table=1) > table=0, priority=10,arp actions=NORMAL > table=0, priority=1 actions=drop > table=1, ct_state=+new+trk,ip,in_port=1 actions=ct(commit),output:2 > table=1, ct_state=+new+trk,ip,in_port=2 actions=drop > table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2 > table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1 > > > Basic idea > ---------- > With the above rules, all packets belonging to an established > connection will first hit the flow > "ct_state=-trk,ip actions=ct(table=1)" > > then on recirculation, they will hit > "ct_state=+est+trk,ip,in_port=.. actions=output:X". > > The basic idea is to do the following 2 steps. > 1. Combine the two sets of actions by removing the recirculation. > a) Original actions: > - "ct(zone=N), recirc(id)" [ i.e ct(table=1) ] > - "output:X" > b) Combined Actions after Removing recirculation: > - "ct(zone=N), output:X". > > 2. All the subsequent packets shall hit a flow with the combined > actions. > > > [Darrell] > > 1) This would be constraining on how rules are written such that we can’t > go back to > the beginning of the pipeline, just because a packet is > established. > Meaning we can’t run stateless or stateful rules earlier in > the pipeline. > There are lots of possible combinations. > One case is: > We may run a bunch of firewall checks in table X, later on > (table X+1) run the non-filtered packets through conntrack > and then want to go back to the firewall table. > [Antonio] I've tried to replicate the case you described. I've used the following set of rules. They may have no practical meaning, the purpose is to have - at least for packets from port 1 - that they traverse tables #0 => 1 => 2 and then are sent back to table#0. NXST_FLOW reply (xid=0x4): table=0, priority=100,ct_state=-trk,ip actions=ct(table=1) table=1, ct_state=+new+trk,ip,in_port=1,nw_src=10.10.10.0/24 actions=ct(commit),output:2 table=1, ct_state=+est+trk,ip,in_port=1 actions=mod_nw_src:3.3.3.3,resubmit(,2) table=2, ct_state=+est+trk,tcp,in_port=1 actions=drop table=2, ct_state=+est+trk,udp,in_port=1 actions=resubmit(1,0) table=0, ct_state=+est+trk,udp,in_port=1,nw_src=3.3.3.3 actions=output:2 table=1, ct_state=+new+trk,ip,in_port=2 actions=drop table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1 table=0, priority=10,arp actions=NORMAL For packets from port1: - table#0: untracked packets go through the CT, then are recirculated, ie go to table#1. - table#1: new conn are committed. - table#1: packets of est conn are modified in the IPsrc that becomes 3.3.3.3, then sent to table#2. - table#2: TCP packets are dropped - but I'm sending just UDP traffic. - table#2: UDP packets go back to table#0. - table#0: est conn packets with IPsrc=3.3.3.3 are sent to port2. In this case we still see that the packets of est conn will skip one recirculation. In fact what happens is the following. For packets belonging to est connections - and coming from port1 - the actions mod_nw_src:3.3.3.3,resubmit(,2) become ct(),mod_nw_src:3.3.3.3,resubmit(,2) and the EMC entry that was hit by the original packet will point to the megaflow table=1, ct_state=+est+trk,ip,in_port=1 actions=ct(),mod_nw_src:3.3.3.3,resubmit(,2) So new packets of the same connection will skip one recirculation. For packets from port2 the megaflow table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1 becomes table=1, ct_state=+est+trk,ip,in_port=2 actions=ct(),output:1 and will be pointed by the EMC entry hit by the original packet. So new packets of the same connection will skip one recirculation. Is this usecase somewhat similar to the one you were mentioning? Or maybe you can suggest some other rule setup? > 2) This allows the dataplane to override what the ofproto layer specifies. > This looks like a layering violation. > > I guess this also speaks to the datapath (userspace) specific design here. > > > I don’t think it matters what the performance advantage is, although it is > small in this case. > > > > > Considerations > -------------- > On EMC entries: > All packets hitting a connection in the same direction will have the > same 5-tuple. > They shall all hit the same EMC entry - here referred as 'EMC_ORIG' - > and point to the flow > "ct_state=-trk,ip actions=ct(table=1)", referred from here as > 'FLOW_ORIG'. > > On recirculation the packets shall hit another EMC entry - referred as > 'EMC_RECIRC' - and point to the flow > "ct_state=+est+trk,ip,in_port=.. actions=output:X", > referred from here as 'FLOW_RECIRC'. > > FLOW_ORIG is the flow hit by the ingress packet while FLOW_RECIRC is > the flow hit > by the recirculated packet. > > > Implementation Details > ---------------------- > We thought about different implementations, but for the purpose of > this RFC we stick to the following approach. > > On packet recirculation, a new flow has to be created with a list of > appropriate actions. At this > stage we prepend a new action called 'ct(Skip)' to the original list > of actions. This > new action is used as a temporary place-holder that will be over > written later on. > > Original Actions: "Output:2" ==> New Actions: "ct(Skip), > Output:2" > > For all the packets of an established conn. (i.e marked with +trk,+est > bits), do the 3 steps: > 1. Retrieve Flows from EMC. > - FLOW_ORIG whose actions include: ct(zone=N), recirc(id) > - FLOW_RECIRC whose actions include: ct(Skip), Output:X > > 2. Update the actions of FLOW_RECIRC by overwriting ct(skip) with the > ct action from FLOW_ORIG, ie ct(zone=N). > ct(Skip), Output:X => ct(zone=N), Output:X > > 3. Update the flow pointer in EMC_ORIG entry to 'FLOW_RECIRC' flow. > > In this way, we avoid recirculating the packets belonging to > established connections and > this no ways skips sending the packets to the CT module. > > > Antonio > > > -----Original Message----- > > From: [email protected] [mailto:ovs-dev- > > [email protected]] On Behalf Of [email protected] > > Sent: Thursday, May 25, 2017 5:03 PM > > To: [email protected] > > Subject: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation for > > established connections. > > > > From: Antonio Fischetti <[email protected]> > > > > With conntrack enabled, packets get recirculated and this impacts > > the performance with thousands of active concurrent connections. > > > > This patch is aimed at avoiding recirculation for packets belonging > to > > established connections in steady state. This is achieved by > > manipulating the megaflows and actions. In this case the actions of > the > > megaflow of recirculated packet is initially updated with new > 'CT_SKIP' > > action and later updated with the actions of the megaflow of the > > original packet(to handle zones). There after the EMC entry > > belonging to the original packet is updated to point to the > 'megaflow' of > > the recirculated packet there by avoiding recirculation. > > > > Signed-off-by: Antonio Fischetti <[email protected]> > > Signed-off-by: Bhanuprakash Bodireddy > <[email protected]> > > Co-authored-by: Bhanuprakash Bodireddy > <[email protected]> > > --- > > - ~10% performance improvement is observed in our testing with UDP > > streams. > > - Limited testing is done with RFC patch and the tests include > hundreds of > > concurrent active TCP connections. > > - This is very early implementation and we are posting here to get > initial > > feedback. > > - This RFC patch is implemented leveraging EMC, but optimization > could be > > extended to dpcls as well to handle higher flows. > > - No VXLAN testing is done with this patch. > > > > datapath/linux/compat/include/linux/openvswitch.h | 1 + > > lib/dpif-netdev.c | 148 > > ++++++++++++++++++++-- > > lib/packets.h | 2 + > > 3 files changed, 143 insertions(+), 8 deletions(-) > > > > diff --git a/datapath/linux/compat/include/linux/openvswitch.h > > b/datapath/linux/compat/include/linux/openvswitch.h > > index d22102e..6dc5674 100644 > > --- a/datapath/linux/compat/include/linux/openvswitch.h > > +++ b/datapath/linux/compat/include/linux/openvswitch.h > > @@ -762,6 +762,7 @@ enum ovs_ct_attr { > > OVS_CT_ATTR_NAT, /* Nested OVS_NAT_ATTR_* */ > > OVS_CT_ATTR_FORCE_COMMIT, /* No argument */ > > OVS_CT_ATTR_EVENTMASK, /* u32 mask of IPCT_* events. */ > > + OVS_CT_ATTR_SKIP, /* No argument, does not invoke CT. > */ > > __OVS_CT_ATTR_MAX > > }; > > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > > index b50164b..fbbb42e 100644 > > --- a/lib/dpif-netdev.c > > +++ b/lib/dpif-netdev.c > > @@ -2101,7 +2101,8 @@ emc_probabilistic_insert(struct > dp_netdev_pmd_thread > > *pmd, > > } > > > > static inline struct dp_netdev_flow * > > -emc_lookup(struct emc_cache *cache, const struct netdev_flow_key > *key) > > +emc_lookup(struct emc_cache *cache, const struct netdev_flow_key > *key, > > + void **pfound_entry) > > { > > struct emc_entry *current_entry; > > > > @@ -2110,11 +2111,13 @@ emc_lookup(struct emc_cache *cache, const > struct > > netdev_flow_key *key) > > && emc_entry_alive(current_entry) > > && netdev_flow_key_equal_mf(¤t_entry->key, &key- > >mf)) { > > > > + *pfound_entry = current_entry; > > /* We found the entry with the 'key->mf' miniflow */ > > return current_entry->flow; > > } > > } > > > > + *pfound_entry = NULL; > > return NULL; > > } > > > > @@ -4583,10 +4586,12 @@ emc_processing(struct dp_netdev_pmd_thread > *pmd, > > key->hash = dpif_netdev_packet_get_rss_hash(packet, &key- > >mf); > > > > /* If EMC is disabled skip emc_lookup */ > > - flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key); > > + flow = (cur_min == 0) ? NULL: > > + emc_lookup(flow_cache, key, &packet- > >md.prev_emc_entry); > > if (OVS_LIKELY(flow)) { > > dp_netdev_queue_batches(packet, flow, &key->mf, > batches, > > n_batches); > > + packet->md.prev_flow = flow; > > } else { > > /* Exact match cache missed. Group missed packets > together at > > * the beginning of the 'packets' array. */ > > @@ -4595,6 +4600,7 @@ emc_processing(struct dp_netdev_pmd_thread > *pmd, > > * must be returned to the caller. The next key should > be > > extracted > > * to 'keys[n_missed + 1]'. */ > > key = &keys[++n_missed]; > > + packet->md.prev_flow = NULL; > > } > > } > > > > @@ -4604,6 +4610,9 @@ emc_processing(struct dp_netdev_pmd_thread > *pmd, > > return dp_packet_batch_size(packets_); > > } > > > > +#define CT_PREPEND_ACTION_LEN 0x0C > > +#define CT_PREPEND_ACTION_SPARE_LEN 32 > > +#define CT_TOT_ACTIONS_LEN (CT_PREPEND_ACTION_LEN + > > CT_PREPEND_ACTION_SPARE_LEN) > > static inline void > > handle_packet_upcall(struct dp_netdev_pmd_thread *pmd, > > struct dp_packet *packet, > > @@ -4663,12 +4672,44 @@ handle_packet_upcall(struct > dp_netdev_pmd_thread > > *pmd, > > ovs_mutex_lock(&pmd->flow_mutex); > > netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key, NULL); > > if (OVS_LIKELY(!netdev_flow)) { > > - netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid, > > - add_actions->data, > > - add_actions->size); > > + /* Recirculated packets that belong to established > > connections > > + * are treated differently. A place-holder is > prepended to > > the > > + * list of actions. */ > > + if (!(packet->md.ct_state & CS_INVALID) && > > + (packet->md.ct_state & CS_TRACKED) && > > + (packet->md.ct_state & CS_ESTABLISHED) && > > + (packet->md.recirc_id)) { > > + > > > [Joe] "Just because the value of ct_state in the packet metadata of > this > particular packet happens to have those particular state flags doesn't > mean that the OpenFlow pipeline is enforcing this constraint for all > packets hitting this flow. You would have to ensure that the mask in > the flow is also ensuring that all future packets hitting the new flow > will also have that particular ct_state." > > This part of the code should prepend a ct(Skip) action to the > FLOW_RECIRC only and just for packets that belong to an established > conn. > The FLOW_ORIG shouldn't be altered, because it will be hit by all > ingress > packets. > > > + /* Prepend an OVS_CT_ATTR_SKIP to the list of > actions > > inside > > + * add_actions. It does not really invoke the > ConnTrack > > module, > > + * it's a Ct action that works as a place-holder. > It > > will be > > + * overwritten inside emc_patch_established_conn() > with > > + * the proper ct(zone=X) action. */ > > + struct dp_netdev_actions *new_actions; > > + uint8_t tmp_action[CT_TOT_ACTIONS_LEN] = { > > + 0x0C, 0x00, 0x0C, 0x00, > > + 0x06, 0x00, 0x09, 0x00, > > + 0x00, 0x00, 0x00, 0x00, > > + /* will append here add_actions. */ > > + }; > > + memcpy(&tmp_action[CT_PREPEND_ACTION_LEN], > > + add_actions->data, > > + add_actions->size); > > + new_actions = dp_netdev_actions_create( > > + (const struct nlattr *)tmp_action, > > + CT_PREPEND_ACTION_LEN + add_actions->size); > > [Joe] "Hmm. In some ways I think this would be cleaner to be > implemented in > the xlate code, around compose_conntrack_action().. but that code is > currently pretty much dataplane-oblivious and this suggestion is a > userspace-specific optimization." > > > > + netdev_flow = dp_netdev_flow_add(pmd, &match, > &ufid, > > + new_actions- > >actions, > > + new_actions- > >size); > > + } else { > > + netdev_flow = dp_netdev_flow_add(pmd, &match, > &ufid, > > + add_actions->data, > > + add_actions- > >size); > > + } > > } > > ovs_mutex_unlock(&pmd->flow_mutex); > > - emc_probabilistic_insert(pmd, key, netdev_flow); > > + /* For this RFC, probabilistic emc insertion is disabled. > */ > > + emc_insert(&pmd->flow_cache, key, netdev_flow); > > } > > } > > > > @@ -4761,7 +4802,8 @@ fast_path_processing(struct > dp_netdev_pmd_thread > > *pmd, > > > > flow = dp_netdev_flow_cast(rules[i]); > > > > - emc_probabilistic_insert(pmd, &keys[i], flow); > > + /* For testing purposes the emc_insert is restored here. */ > > + emc_insert(&pmd->flow_cache, &keys[i], flow); > > dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, > > n_batches); > > } > > > > @@ -4998,6 +5040,78 @@ dp_execute_userspace_action(struct > > dp_netdev_pmd_thread *pmd, > > } > > } > > > > +/* Search into EMC to retrieve the entry related to the original > packet > > + * and the entry related to the recirculated packet. > > + * If both EMC entries are alive, then: > > + * - The flow actions of the recirculated packet are updated with > > + * 'ct(zone=N)' as retrieved from the actions of the original > flow. > > + * - The EMC orig entry flow is updated with the flow pointer of > recirc > > entry. > > + */ > > +static inline void > > +emc_patch_established_conn(struct dp_netdev_pmd_thread *pmd, > > + struct dp_packet *packet, long long now OVS_UNUSED) > > +{ > > + struct emc_cache *cache = &pmd->flow_cache; > > + struct netdev_flow_key key; > > + struct emc_entry *orig_entry; /* EMC entry hit by the original > > packet. */ > > + struct emc_entry *recirc_entry; /* EMC entry for recirculated > packet. > > */ > > + uint32_t old_hash; > > + > > + if (!packet->md.prev_emc_entry) { > > + return; > > + } > > + > > + orig_entry = packet->md.prev_emc_entry; > > + if (!emc_entry_alive(orig_entry)) { > > + return; > > + } > > + > > + /* Check that the original EMC entry was not overwritten. */ > > + struct dp_netdev_flow *orig_flow = orig_entry->flow; > > + if (packet->md.prev_flow && (packet->md.prev_flow != > orig_flow)) { > > + return; > > + } > > + > > + /* TODO as we are calling miniflow_extract now, can we avoid to > > invoke > > + * it again when we'll classify this recirculated packet? */ > > + miniflow_extract(packet, &key.mf); > > + key.len = 0; /* Not computed yet. */ > > + old_hash = dp_packet_get_rss_hash(packet); > > + key.hash = dpif_netdev_packet_get_rss_hash(packet, NULL); > > + dp_packet_set_rss_hash(packet, old_hash); > > + > > + EMC_FOR_EACH_POS_WITH_HASH(cache, recirc_entry, key.hash) { > > + if (recirc_entry->key.hash == key.hash > > + && emc_entry_alive(recirc_entry) > > + && netdev_flow_key_equal_mf(&recirc_entry->key, > &key.mf)) { > > + > > + if (orig_entry->flow == recirc_entry->flow) { > > + /* Nothing to do, already patched by a packet of > this > > + * same batch. */ > > + return; > > + } > > + /* We found the entry related to the recirculated > packet. > > + * Retrieve the actions for orig and recirc entries. * > */ > > + struct dp_netdev_actions * orig_actions = > > + dp_netdev_flow_get_actions(orig_entry->flow); > > + struct dp_netdev_actions * recirc_actions = > > + dp_netdev_flow_get_actions(recirc_entry->flow); > > + > > + /* The orig entry action contains a CT action with the > zone > > info. */ > > + struct nlattr *p = &orig_actions->actions[0]; > > + > > + /* Overwrite the CT Skip action of recirc entry with > > ct(zone=N). */ > > + memcpy(recirc_actions->actions, p, p->nla_len); > > + > > + /* Update orig EMC entry. */ > > + orig_entry->flow = recirc_entry->flow; > > + dp_netdev_flow_ref(orig_entry->flow); > > + > > + return; > > + } > > + } > > +} > > + > > static void > > dp_execute_cb(void *aux_, struct dp_packet_batch *packets_, > > const struct nlattr *a, bool may_steal) > > @@ -5024,7 +5138,6 @@ dp_execute_cb(void *aux_, struct > dp_packet_batch > > *packets_, > > } else { > > tx_qid = pmd->static_tx_qid; > > } > > - > > netdev_send(p->port->netdev, tx_qid, packets_, > may_steal, > > dynamic_txqs); > > return; > > [Joe] "Unrelated style change." > > Will do that. > > > @@ -5145,10 +5258,21 @@ dp_execute_cb(void *aux_, struct > dp_packet_batch > > *packets_, > > struct dp_packet *packet; > > DP_PACKET_BATCH_FOR_EACH (packet, packets_) { > > packet->md.recirc_id = nl_attr_get_u32(a); > > + > > + if (!(packet->md.ct_state & CS_INVALID) && > > + (packet->md.ct_state & CS_TRACKED) && > > + (packet->md.ct_state & CS_ESTABLISHED)) { > > + (*depth)++; > > + emc_patch_established_conn(pmd, packet, now); > > + (*depth)--; > > + } > > } > > > > (*depth)++; > > dp_netdev_recirculate(pmd, packets_); > > + DP_PACKET_BATCH_FOR_EACH (packet, packets_) { > > + packet->md.recirc_id = 0; > > + } > > (*depth)--; > > > > return; > > [Joe] "Can you give some more detail about why resetting the recirc_id > is > correct here? Could/should this be applied as an independent change > regardless of the rest of this series?" > > This addition is needed so that in handle_packet_upcall we add the > ct(Skip) > just to the FLOW_RECIRC and not to the FLOW_ORIG. > I guess this could be also applied as an independent change? > > > @@ -5166,6 +5290,7 @@ dp_execute_cb(void *aux_, struct > dp_packet_batch > > *packets_, > > const char *helper = NULL; > > const uint32_t *setmark = NULL; > > const struct ovs_key_ct_labels *setlabel = NULL; > > + bool skip_ct = false; > > > > NL_ATTR_FOR_EACH_UNSAFE (b, left, nl_attr_get(a), > > nl_attr_get_size(a)) { > > @@ -5194,6 +5319,9 @@ dp_execute_cb(void *aux_, struct > dp_packet_batch > > *packets_, > > /* Silently ignored, as userspace datapath does not > > generate > > * netlink events. */ > > break; > > + case OVS_CT_ATTR_SKIP: > > + skip_ct = true; > > + break; > > case OVS_CT_ATTR_NAT: > > case OVS_CT_ATTR_UNSPEC: > > case __OVS_CT_ATTR_MAX: > > @@ -5201,6 +5329,10 @@ dp_execute_cb(void *aux_, struct > dp_packet_batch > > *packets_, > > } > > } > > > > + if (OVS_UNLIKELY(skip_ct)) { > > + break; > > + } > > + > > conntrack_execute(&dp->conntrack, packets_, aux->flow- > >dl_type, > > force, > > commit, zone, setmark, setlabel, helper); > > break; > > diff --git a/lib/packets.h b/lib/packets.h > > index 7dbf6dd..2c46a15 100644 > > --- a/lib/packets.h > > +++ b/lib/packets.h > > @@ -112,6 +112,8 @@ struct pkt_metadata { > > struct flow_tnl tunnel; /* Encapsulating tunnel parameters. > Note > > that > > * if 'ip_dst' == 0, the rest of > the > > fields may > > * be uninitialized. */ > > + void *prev_emc_entry; /* EMC entry that this packet > matched. */ > > + void *prev_flow; /* Flow pointed by the matching EMC > > entry. */ > > }; > > > > static inline void > > -- > > 2.4.11 > > > > _______________________________________________ > > dev mailing list > > [email protected] > > https://urldefense.proofpoint.com/v2/url?u=https- > 3A__mail.openvswitch.org_mailman_listinfo_ovs- > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih- > uZnsw&m=nVfAqNkwbufuhX-dkxok67-OaTEtfFTDAj- > Z39vSDZE&s=7SujU3lZVLCtG1pTSrZt1CMmmuv4iKnSkprKdkxSXZY&e= > _______________________________________________ > dev mailing list > [email protected] > https://urldefense.proofpoint.com/v2/url?u=https- > 3A__mail.openvswitch.org_mailman_listinfo_ovs- > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih- > uZnsw&m=nVfAqNkwbufuhX-dkxok67-OaTEtfFTDAj- > Z39vSDZE&s=7SujU3lZVLCtG1pTSrZt1CMmmuv4iKnSkprKdkxSXZY&e= > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
