Thanks Sugesh for your comments, pls see replies inline.
-Antonio

> -----Original Message-----
> From: Chandran, Sugesh
> Sent: Friday, June 23, 2017 4:20 PM
> To: Fischetti, Antonio <[email protected]>; [email protected]
> Subject: RE: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation for
> established connections.
> 
> Hi Antonio,
> 
> Thank you for the patches,
> 
> Please find my comments below.
> 
> 
> 
> Regards
> _Sugesh
> 
> 
> > -----Original Message-----
> > From: [email protected] [mailto:ovs-dev-
> > [email protected]] On Behalf Of Fischetti, Antonio
> > Sent: Tuesday, June 6, 2017 5:15 PM
> > To: Darrell Ball <[email protected]>; [email protected]
> > Subject: Re: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation for
> > established connections.
> >
> > Thanks Darrel for your useful comments, I've tried to replicate the
> usecases
> > you mentioned, please find inline some details.
> >
> > Regards,
> > Antonio
> >
> > > -----Original Message-----
> > > From: Darrell Ball [mailto:[email protected]]
> > > Sent: Thursday, June 1, 2017 6:20 PM
> > > To: Fischetti, Antonio <[email protected]>;
> > > [email protected]
> > > Subject: Re: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation
> > > for established connections.
> > >
> > > Comments inline
> > >
> > > On 5/29/17, 8:24 AM, "[email protected] on behalf of
> > > Fischetti, Antonio" <[email protected] on behalf of
> > > [email protected]> wrote:
> > >
> > >     Thanks Joe for your feedback and the interesting insights in
> > > conntrack in your earlier communication.
> > >     We have added all the details that we considered for this first
> > > implementation. Also, some answers are inline.
> > >
> > >     The purpose of this implementation is to avoid recirculation just
> > > for those packets that are part of established connections.
> > >
> > >     This shouldn't affect the packet recirculation for actions other
> > > than conntrack. For example in MPLS, after a pop_mpls action the
> > > packet will still be recirculated to follow the usual datapath.
> > >
> > >     Most importantly, the CT module isn't by-passed in this
> > > implementation.
> > >
> > >     Besides the recirculation, all other action[s] shall be executed
> > > as-is on each packet.
> > >     Any new CT change or action set by the controller will be managed
> > > as usual.
> > >
> > >     For our testing we set up a simple firewall, below are the flows.
> > >
> > >
> > >      Flow Setup
> > >      ----------
> > >     table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)
> > >     table=0, priority=10,arp actions=NORMAL
> > >     table=0, priority=1 actions=drop
> > >     table=1, ct_state=+new+trk,ip,in_port=1
> actions=ct(commit),output:2
> > >     table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
> > >     table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2
> > >     table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
> > >
> > >
> > >      Basic idea
> > >      ----------
> > >     With the above rules, all packets belonging to an established
> > > connection will first hit the flow
> > >     "ct_state=-trk,ip actions=ct(table=1)"
> > >
> > >     then on recirculation, they will hit
> > >     "ct_state=+est+trk,ip,in_port=.. actions=output:X".
> > >
> > >     The basic idea is to do the following 2 steps.
> > >     1. Combine the two sets of actions by removing the recirculation.
> > >        a) Original actions:
> > >         - "ct(zone=N), recirc(id)" [ i.e ct(table=1) ]
> > >         - "output:X"
> > >        b) Combined Actions after Removing recirculation:
> > >         - "ct(zone=N), output:X".
> > >
> > >     2. All the subsequent packets shall hit a flow with the combined
> > > actions.
> > >
> > >
> > > [Darrell]
> > >
> > > 1) This would be constraining on how rules are written such that we
> > > can’t go back to
> > >               the beginning of the pipeline, just because a packet is
> > > established.
> > >               Meaning we can’t run stateless or stateful rules earlier
> > > in the pipeline.
> > >               There are lots of possible combinations.
> > >               One case is:
> > >               We may run a bunch of firewall checks in table X, later
> > > on (table X+1) run the non-filtered packets through conntrack
> > >               and then want to go back to the firewall table.
> > >
> >
> > [Antonio]
> > I've tried to replicate the case you described. I've used the following
> set of
> > rules. They may have no practical meaning, the purpose is to have - at
> least
> > for packets from port 1 - that they traverse tables #0 => 1 => 2 and
> then are
> > sent back to table#0.
> >
> > NXST_FLOW reply (xid=0x4):
> >  table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)  table=1,
> > ct_state=+new+trk,ip,in_port=1,nw_src=10.10.10.0/24
> > actions=ct(commit),output:2  table=1, ct_state=+est+trk,ip,in_port=1
> > actions=mod_nw_src:3.3.3.3,resubmit(,2)
> >  table=2, ct_state=+est+trk,tcp,in_port=1 actions=drop  table=2,
> > ct_state=+est+trk,udp,in_port=1 actions=resubmit(1,0)  table=0,
> > ct_state=+est+trk,udp,in_port=1,nw_src=3.3.3.3 actions=output:2
> table=1,
> > ct_state=+new+trk,ip,in_port=2 actions=drop  table=1,
> > ct_state=+est+trk,ip,in_port=2 actions=output:1  table=0,
> priority=10,arp
> > actions=NORMAL
> >
> > For packets from port1:
> >  - table#0: untracked packets go through the CT, then are recirculated,
> ie go
> > to table#1.
> >  - table#1: new conn are committed.
> >  - table#1: packets of est conn are modified in the IPsrc that becomes
> 3.3.3.3,
> >             then sent to table#2.
> >  - table#2: TCP packets are dropped - but I'm sending just UDP traffic.
> >  - table#2: UDP packets go back to table#0.
> >  - table#0: est conn packets with IPsrc=3.3.3.3 are sent to port2.
> >
> > In this case we still see that the packets of est conn will skip one
> recirculation.
> >
> > In fact what happens is the following.
> > For packets belonging to est connections - and coming from
> > port1 - the actions
> >    mod_nw_src:3.3.3.3,resubmit(,2)
> > become
> >    ct(),mod_nw_src:3.3.3.3,resubmit(,2)
> >
> > and the EMC entry that was hit by the original packet will point to the
> > megaflow
> >    table=1, ct_state=+est+trk,ip,in_port=1
> > actions=ct(),mod_nw_src:3.3.3.3,resubmit(,2)
> > So new packets of the same connection will skip one recirculation.
> >
> > For packets from port2 the megaflow
> >    table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1 becomes
> >    table=1, ct_state=+est+trk,ip,in_port=2 actions=ct(),output:1 and
> will be
> > pointed by the EMC entry hit by the original packet.
> > So new packets of the same connection will skip one recirculation.
> >
> > Is this usecase somewhat similar to the one you were mentioning?
> > Or maybe you can suggest some other rule setup?
> >
> >
> > > 2) This allows the dataplane to override what the ofproto layer
> specifies.
> > > This looks like a layering violation.
> > >
> > > I guess this also speaks to the datapath (userspace) specific design
> here.
> > >
> > >
> > > I don’t think it matters what the performance advantage is, although
> > > it is small in this case.
> > >
> > >
> > >
> > >
> > >      Considerations
> > >      --------------
> > >     On EMC entries:
> > >     All packets hitting a connection in the same direction will have
> > > the same 5-tuple.
> > >     They shall all hit the same EMC entry - here referred as
> > > 'EMC_ORIG' - and point to the flow
> > >     "ct_state=-trk,ip actions=ct(table=1)", referred from here as
> > > 'FLOW_ORIG'.
> > >
> > >     On recirculation the packets shall hit another EMC entry -
> > > referred as 'EMC_RECIRC' - and point to the flow
> > >     "ct_state=+est+trk,ip,in_port=.. actions=output:X",
> > >     referred from here as 'FLOW_RECIRC'.
> > >
> > >     FLOW_ORIG is the flow hit by the ingress packet while FLOW_RECIRC
> > > is the flow hit
> > >     by the recirculated packet.
> > >
> > >
> > >      Implementation Details
> > >      ----------------------
> > >     We thought about different implementations, but for the purpose of
> > > this RFC we stick to the following approach.
> > >
> > >     On packet recirculation, a new flow has to be created with a list
> > > of appropriate actions. At this
> > >     stage we prepend a new action called 'ct(Skip)' to the original
> > > list of actions. This
> > >     new action is used as a temporary place-holder that will be over
> > > written later on.
> > >
> > >        Original Actions: "Output:2"    ==>    New Actions: "ct(Skip),
> > > Output:2"
> > >
> > >     For all the packets of an established conn. (i.e marked with
> > > +trk,+est bits), do the 3 steps:
> > >     1. Retrieve Flows from EMC.
> > >         - FLOW_ORIG whose actions include:   ct(zone=N), recirc(id)
> > >         - FLOW_RECIRC whose actions include: ct(Skip), Output:X
> > >
> > >     2. Update the actions of FLOW_RECIRC by overwriting ct(skip) with
> > > the ct action from FLOW_ORIG, ie ct(zone=N).
> > >         ct(Skip), Output:X  =>  ct(zone=N), Output:X
> > >
> > >     3. Update the flow pointer in EMC_ORIG entry to 'FLOW_RECIRC'
> flow.
> > >
> > >     In this way, we avoid recirculating the packets belonging to
> > > established connections and
> > >     this no ways skips sending the packets to the CT module.
> > >
> > >
> > >     Antonio
> > >
> > >     > -----Original Message-----
> > >     > From: [email protected] [mailto:ovs-dev-
> > >     > [email protected]] On Behalf Of
> [email protected]
> > >     > Sent: Thursday, May 25, 2017 5:03 PM
> > >     > To: [email protected]
> > >     > Subject: [ovs-dev] [PATCH RFC v2] Conntrack: Avoid recirculation
> for
> > >     > established connections.
> > >     >
> > >     > From: Antonio Fischetti <[email protected]>
> > >     >
> > >     > With conntrack enabled, packets get recirculated and this
> impacts
> > >     > the performance with thousands of active concurrent connections.
> > >     >
> > >     > This patch is aimed at avoiding recirculation for packets
> > > belonging to
> > >     > established connections in steady state. This is achieved by
> > >     > manipulating the megaflows and actions. In this case the actions
> > > of the
> > >     > megaflow of recirculated packet is initially updated with new
> > > 'CT_SKIP'
> > >     > action and later updated with the actions of the megaflow of the
> > >     > original packet(to handle zones). There after the EMC entry
> > >     > belonging to the original packet is updated to point to the
> > > 'megaflow' of
> > >     > the recirculated packet there by avoiding recirculation.
> > >     >
> > >     > Signed-off-by: Antonio Fischetti <[email protected]>
> > >     > Signed-off-by: Bhanuprakash Bodireddy
> > > <[email protected]>
> > >     > Co-authored-by: Bhanuprakash Bodireddy
> > > <[email protected]>
> > >     > ---
> > >     > - ~10% performance improvement is observed in our testing with
> UDP
> > >     > streams.
> > >     > - Limited testing is done with RFC patch and the tests include
> > > hundreds of
> > >     >   concurrent active TCP connections.
> > >     > - This is very early implementation and we are posting here to
> > > get initial
> > >     >   feedback.
> > >     > - This RFC patch is implemented leveraging EMC, but optimization
> > > could be
> > >     >   extended to dpcls as well to handle higher flows.
> > >     > - No VXLAN testing is done with this patch.
> > >     >
> > >     >  datapath/linux/compat/include/linux/openvswitch.h |   1 +
> > >     >  lib/dpif-netdev.c                                 | 148
> > >     > ++++++++++++++++++++--
> > >     >  lib/packets.h                                     |   2 +
> > >     >  3 files changed, 143 insertions(+), 8 deletions(-)
> > >     >
> > >     > diff --git a/datapath/linux/compat/include/linux/openvswitch.h
> > >     > b/datapath/linux/compat/include/linux/openvswitch.h
> > >     > index d22102e..6dc5674 100644
> > >     > --- a/datapath/linux/compat/include/linux/openvswitch.h
> > >     > +++ b/datapath/linux/compat/include/linux/openvswitch.h
> > >     > @@ -762,6 +762,7 @@ enum ovs_ct_attr {
> > >     >     OVS_CT_ATTR_NAT,        /* Nested OVS_NAT_ATTR_* */
> > >     >     OVS_CT_ATTR_FORCE_COMMIT,  /* No argument */
> > >     >     OVS_CT_ATTR_EVENTMASK,  /* u32 mask of IPCT_* events. */
> > >     > +   OVS_CT_ATTR_SKIP,       /* No argument, does not invoke CT.
> > > */
> > >     >     __OVS_CT_ATTR_MAX
> > >     >  };
> > >     >
> > >     > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > >     > index b50164b..fbbb42e 100644
> > >     > --- a/lib/dpif-netdev.c
> > >     > +++ b/lib/dpif-netdev.c
> > >     > @@ -2101,7 +2101,8 @@ emc_probabilistic_insert(struct
> > > dp_netdev_pmd_thread
> > >     > *pmd,
> > >     >  }
> > >     >
> > >     >  static inline struct dp_netdev_flow *
> > >     > -emc_lookup(struct emc_cache *cache, const struct
> > > netdev_flow_key
> > > *key)
> > >     > +emc_lookup(struct emc_cache *cache, const struct
> > > netdev_flow_key *key,
> > >     > +        void **pfound_entry)
> > >     >  {
> > >     >      struct emc_entry *current_entry;
> > >     >
> > >     > @@ -2110,11 +2111,13 @@ emc_lookup(struct emc_cache *cache,
> > > const struct
> > >     > netdev_flow_key *key)
> > >     >              && emc_entry_alive(current_entry)
> > >     >              && netdev_flow_key_equal_mf(&current_entry->key,
> &key-
> > > >mf)) {
> > >     >
> > >     > +            *pfound_entry = current_entry;
> > >     >              /* We found the entry with the 'key->mf' miniflow
> */
> > >     >              return current_entry->flow;
> > >     >          }
> > >     >      }
> > >     >
> > >     > +    *pfound_entry = NULL;
> > >     >      return NULL;
> > >     >  }
> > >     >
> > >     > @@ -4583,10 +4586,12 @@ emc_processing(struct
> > > dp_netdev_pmd_thread *pmd,
> > >     >          key->hash = dpif_netdev_packet_get_rss_hash(packet,
> &key-
> > > >mf);
> > >     >
> > >     >          /* If EMC is disabled skip emc_lookup */
> > >     > -        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache,
> key);
> > >     > +        flow = (cur_min == 0) ? NULL:
> > >     > +                emc_lookup(flow_cache, key, &packet-
> > > >md.prev_emc_entry);
> > >     >          if (OVS_LIKELY(flow)) {
> > >     >              dp_netdev_queue_batches(packet, flow, &key->mf,
> > > batches,
> > >     >                                      n_batches);
> > >     > +            packet->md.prev_flow = flow;
> > >     >          } else {
> > >     >              /* Exact match cache missed. Group missed packets
> > > together at
> > >     >               * the beginning of the 'packets' array. */
> > >     > @@ -4595,6 +4600,7 @@ emc_processing(struct
> > dp_netdev_pmd_thread
> > > *pmd,
> > >     >               * must be returned to the caller. The next key
> should
> > > be
> > >     > extracted
> > >     >               * to 'keys[n_missed + 1]'. */
> > >     >              key = &keys[++n_missed];
> > >     > +            packet->md.prev_flow = NULL;
> > >     >          }
> > >     >      }
> > >     >
> > >     > @@ -4604,6 +4610,9 @@ emc_processing(struct
> > dp_netdev_pmd_thread
> > > *pmd,
> > >     >      return dp_packet_batch_size(packets_);
> > >     >  }
> > >     >
> > >     > +#define CT_PREPEND_ACTION_LEN 0x0C
> > >     > +#define CT_PREPEND_ACTION_SPARE_LEN 32
> > >     > +#define CT_TOT_ACTIONS_LEN (CT_PREPEND_ACTION_LEN +
> > >     > CT_PREPEND_ACTION_SPARE_LEN)
> > >     >  static inline void
> > >     >  handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
> > >     >                       struct dp_packet *packet,
> > >     > @@ -4663,12 +4672,44 @@ handle_packet_upcall(struct
> > > dp_netdev_pmd_thread
> > >     > *pmd,
> > >     >          ovs_mutex_lock(&pmd->flow_mutex);
> > >     >          netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key,
> NULL);
> > >     >          if (OVS_LIKELY(!netdev_flow)) {
> > >     > -            netdev_flow = dp_netdev_flow_add(pmd, &match,
> &ufid,
> > >     > -                                             add_actions->data,
> > >     > -                                             add_actions-
> >size);
> > >     > +            /* Recirculated packets that belong to established
> > >     > connections
> > >     > +             * are treated differently.  A place-holder is
> > > prepended to
> > >     > the
> > >     > +             * list of actions. */
> > >     > +            if (!(packet->md.ct_state & CS_INVALID) &&
> > >     > +                 (packet->md.ct_state & CS_TRACKED) &&
> > >     > +                 (packet->md.ct_state & CS_ESTABLISHED) &&
> > >     > +                 (packet->md.recirc_id)) {
> > >     > +
> > >
> > >
> > >     [Joe] "Just because the value of ct_state in the packet metadata
> > > of this
> > >     particular packet happens to have those particular state flags
> doesn't
> > >     mean that the OpenFlow pipeline is enforcing this constraint for
> all
> > >     packets hitting this flow. You would have to ensure that the mask
> in
> > >     the flow is also ensuring that all future packets hitting the new
> flow
> > >     will also have that particular ct_state."
> > >
> > >     This part of the code should prepend a ct(Skip) action to the
> > >     FLOW_RECIRC only and just for packets that belong to an
> > > established conn.
> > >     The FLOW_ORIG shouldn't be altered, because it will be hit by all
> > > ingress
> > >     packets.
> > >
> > >     > +                /* Prepend an OVS_CT_ATTR_SKIP to the list of
> > > actions
> > >     > inside
> > >     > +                 * add_actions.  It does not really invoke the
> > > ConnTrack
> > >     > module,
> > >     > +                 * it's a Ct action that works as a place-
> holder.
> > > It
> > >     > will be
> > >     > +                 * overwritten inside
> emc_patch_established_conn()
> > > with
> > >     > +                 * the proper ct(zone=X) action. */
> > >     > +                struct dp_netdev_actions *new_actions;
> > >     > +                uint8_t tmp_action[CT_TOT_ACTIONS_LEN] = {
> > >     > +                        0x0C, 0x00, 0x0C, 0x00,
> > >     > +                        0x06, 0x00, 0x09, 0x00,
> > >     > +                        0x00, 0x00, 0x00, 0x00,
> > >     > +                        /* will append here add_actions. */
> > >     > +                };
> > >     > +                memcpy(&tmp_action[CT_PREPEND_ACTION_LEN],
> > >     > +                        add_actions->data,
> > >     > +                        add_actions->size);
> > >     > +                new_actions = dp_netdev_actions_create(
> > >     > +                        (const struct nlattr *)tmp_action,
> > >     > +                        CT_PREPEND_ACTION_LEN + add_actions-
> >size);
> > >
> > >     [Joe] "Hmm. In some ways I think this would be cleaner to be
> > > implemented in
> > >     the xlate code, around compose_conntrack_action().. but that code
> is
> > >     currently pretty much dataplane-oblivious and this suggestion is a
> > >     userspace-specific optimization."
> > >
> > >
> > >     > +                netdev_flow = dp_netdev_flow_add(pmd, &match,
> > > &ufid,
> > >     > +                                                 new_actions-
> > > >actions,
> > >     > +                                                 new_actions-
> > > >size);
> > >     > +            } else {
> > >     > +                netdev_flow = dp_netdev_flow_add(pmd, &match,
> > > &ufid,
> > >     > +                                                 add_actions-
> >data,
> > >     > +                                                 add_actions-
> > > >size);
> > >     > +            }
> > >     >          }
> > >     >          ovs_mutex_unlock(&pmd->flow_mutex);
> > >     > -        emc_probabilistic_insert(pmd, key, netdev_flow);
> > >     > +        /* For this RFC, probabilistic emc insertion is
> disabled.
> > > */
> > >     > +        emc_insert(&pmd->flow_cache, key, netdev_flow);
> > >     >      }
> > >     >  }
> > >     >
> > >     > @@ -4761,7 +4802,8 @@ fast_path_processing(struct
> > > dp_netdev_pmd_thread
> > >     > *pmd,
> > >     >
> > >     >          flow = dp_netdev_flow_cast(rules[i]);
> > >     >
> > >     > -        emc_probabilistic_insert(pmd, &keys[i], flow);
> > >     > +        /* For testing purposes the emc_insert is restored
> here. */
> > >     > +        emc_insert(&pmd->flow_cache, &keys[i], flow);
> > >     >          dp_netdev_queue_batches(packet, flow, &keys[i].mf,
> batches,
> > >     > n_batches);
> > >     >      }
> > >     >
> > >     > @@ -4998,6 +5040,78 @@ dp_execute_userspace_action(struct
> > >     > dp_netdev_pmd_thread *pmd,
> > >     >      }
> > >     >  }
> > >     >
> > >     > +/* Search into EMC to retrieve the entry related to the
> > > original packet
> > >     > + * and the entry related to the recirculated packet.
> > >     > + * If both EMC entries are alive, then:
> > >     > + *  - The flow actions of the recirculated packet are updated
> with
> > >     > + *    'ct(zone=N)' as retrieved from the actions of the
> original
> > > flow.
> > >     > + *  - The EMC orig entry flow is updated with the flow pointer
> > > of recirc
> > >     > entry.
> > >     > + */
> > >     > +static inline void
> > >     > +emc_patch_established_conn(struct dp_netdev_pmd_thread *pmd,
> > >     > +        struct dp_packet *packet, long long now OVS_UNUSED)
> > >     > +{
> > >     > +    struct emc_cache *cache = &pmd->flow_cache;
> > >     > +    struct netdev_flow_key key;
> > >     > +    struct emc_entry *orig_entry; /* EMC entry hit by the
> original
> > >     > packet. */
> > >     > +    struct emc_entry *recirc_entry; /* EMC entry for
> recirculated
> > > packet.
> > >     > */
> > >     > +    uint32_t old_hash;
> > >     > +
> > >     > +    if (!packet->md.prev_emc_entry) {
> > >     > +        return;
> > >     > +    }
> > >     > +
> > >     > +    orig_entry = packet->md.prev_emc_entry;
> > >     > +    if (!emc_entry_alive(orig_entry)) {
> > >     > +        return;
> > >     > +    }
> > >     > +
> > >     > +    /* Check that the original EMC entry was not overwritten.
> */
> > >     > +    struct dp_netdev_flow *orig_flow = orig_entry->flow;
> > >     > +    if (packet->md.prev_flow && (packet->md.prev_flow !=
> > > orig_flow)) {
> > >     > +       return;
> > >     > +    }
> > >     > +
> > >     > +    /* TODO as we are calling miniflow_extract now, can we
> avoid to
> > >     > invoke
> > >     > +     * it again when we'll classify this recirculated packet?
> */
> > >     > +    miniflow_extract(packet, &key.mf);
> > >     > +    key.len = 0; /* Not computed yet. */
> > >     > +    old_hash = dp_packet_get_rss_hash(packet);
> > >     > +    key.hash = dpif_netdev_packet_get_rss_hash(packet, NULL);
> > >     > +    dp_packet_set_rss_hash(packet, old_hash);
> > >     > +
> > >     > +    EMC_FOR_EACH_POS_WITH_HASH(cache, recirc_entry, key.hash) {
> > >     > +        if (recirc_entry->key.hash == key.hash
> > >     > +            && emc_entry_alive(recirc_entry)
> > >     > +            && netdev_flow_key_equal_mf(&recirc_entry->key,
> > > &key.mf)) {
> > >     > +
> > >     > +            if (orig_entry->flow == recirc_entry->flow) {
> > >     > +                /* Nothing to do, already patched by a packet
> of
> > > this
> > >     > +                 * same batch. */
> > >     > +                return;
> > >     > +            }
> > >     > +            /* We found the entry related to the recirculated
> > > packet.
> > >     > +             * Retrieve the actions for orig and recirc
> entries. *
> > > */
> > >     > +            struct dp_netdev_actions * orig_actions =
> > >     > +                    dp_netdev_flow_get_actions(orig_entry-
> >flow);
> > >     > +            struct dp_netdev_actions * recirc_actions =
> > >     > +                    dp_netdev_flow_get_actions(recirc_entry-
> >flow);
> > >     > +
> > >     > +            /* The orig entry action contains a CT action with
> the
> > > zone
> > >     > info. */
> [Sugesh] is it safe to assume first action would be ct ??

[Antonio] Right, in general it could be something different than ct.
For this RFC I'm assuming a simplified case.


> > >     > +            struct nlattr *p = &orig_actions->actions[0];
> > >     > +
> > >     > +            /* Overwrite the CT Skip action of recirc entry
> with
> > >     > ct(zone=N). */
> > >     > +            memcpy(recirc_actions->actions, p, p->nla_len);
> [Sugesh] What if recirc_action size is smaller than p? shouldn’t we make
> sure the size of skip action match with size of actions present in
> ct_table?

[Antonio] The skip action is just a place-holder to accommodate the list of
ct actions from the orig flow. It should be sufficient that the skip action
is bigger. 
Worth remember that this solution is a quick implementation of the basic idea,
the purpose was to share the approach to better analyze all possible pros/cons
and corner cases.
A better implementation could be for example - instead of tampering the 
actions of the recirc flow - to create a brand new flow which
has the combined actions picked up from the orig and the recirc flow.


> 
> > >     > +
> > >     > +            /* Update orig EMC entry. */
> > >     > +            orig_entry->flow = recirc_entry->flow;
> > >     > +            dp_netdev_flow_ref(orig_entry->flow);
> > >     > +
> > >     > +            return;
> > >     > +        }
> > >     > +    }
> > >     > +}
> > >     > +
> > >     >  static void
> > >     >  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
> > >     >                const struct nlattr *a, bool may_steal)
> > >     > @@ -5024,7 +5138,6 @@ dp_execute_cb(void *aux_, struct
> > > dp_packet_batch
> > >     > *packets_,
> > >     >              } else {
> > >     >                  tx_qid = pmd->static_tx_qid;
> > >     >              }
> > >     > -
> > >     >              netdev_send(p->port->netdev, tx_qid, packets_,
> > > may_steal,
> > >     >                          dynamic_txqs);
> > >     >              return;
> > >
> > >     [Joe] "Unrelated style change."
> > >
> > >     Will do that.
> > >
> > >     > @@ -5145,10 +5258,21 @@ dp_execute_cb(void *aux_, struct
> > > dp_packet_batch
> > >     > *packets_,
> > >     >              struct dp_packet *packet;
> > >     >              DP_PACKET_BATCH_FOR_EACH (packet, packets_) {
> > >     >                  packet->md.recirc_id = nl_attr_get_u32(a);
> > >     > +
> > >     > +                if (!(packet->md.ct_state & CS_INVALID) &&
> > >     > +                     (packet->md.ct_state & CS_TRACKED) &&
> > >     > +                     (packet->md.ct_state & CS_ESTABLISHED)) {
> > >     > +                    (*depth)++;
> > >     > +                    emc_patch_established_conn(pmd, packet,
> now);
> > >     > +                    (*depth)--;
> > >     > +                }
> > >     >              }
> > >     >
> > >     >              (*depth)++;
> > >     >              dp_netdev_recirculate(pmd, packets_);
> > >     > +            DP_PACKET_BATCH_FOR_EACH (packet, packets_) {
> > >     > +                packet->md.recirc_id = 0;
> > >     > +            }
> > >     >              (*depth)--;
> > >     >
> > >     >              return;
> > >
> > >     [Joe] "Can you give some more detail about why resetting the
> > > recirc_id is
> > >     correct here? Could/should this be applied as an independent
> change
> > >     regardless of the rest of this series?"
> > >
> > >     This addition is needed so that in handle_packet_upcall we add the
> > > ct(Skip)
> > >     just to the FLOW_RECIRC and not to the FLOW_ORIG.
> > >     I guess this could be also applied as an independent change?
> > >
> > >     > @@ -5166,6 +5290,7 @@ dp_execute_cb(void *aux_, struct
> > > dp_packet_batch
> > >     > *packets_,
> > >     >          const char *helper = NULL;
> > >     >          const uint32_t *setmark = NULL;
> > >     >          const struct ovs_key_ct_labels *setlabel = NULL;
> > >     > +        bool skip_ct = false;
> > >     >
> > >     >          NL_ATTR_FOR_EACH_UNSAFE (b, left, nl_attr_get(a),
> > >     >                                   nl_attr_get_size(a)) {
> > >     > @@ -5194,6 +5319,9 @@ dp_execute_cb(void *aux_, struct
> > > dp_packet_batch
> > >     > *packets_,
> > >     >                  /* Silently ignored, as userspace datapath does
> not
> > >     > generate
> > >     >                   * netlink events. */
> > >     >                  break;
> > >     > +            case OVS_CT_ATTR_SKIP:
> > >     > +                skip_ct = true;
> > >     > +                break;
> > >     >              case OVS_CT_ATTR_NAT:
> > >     >              case OVS_CT_ATTR_UNSPEC:
> > >     >              case __OVS_CT_ATTR_MAX:
> > >     > @@ -5201,6 +5329,10 @@ dp_execute_cb(void *aux_, struct
> > > dp_packet_batch
> > >     > *packets_,
> > >     >              }
> > >     >          }
> > >     >
> > >     > +        if (OVS_UNLIKELY(skip_ct)) {
> > >     > +            break;
> > >     > +        }
> > >     > +
> > >     >          conntrack_execute(&dp->conntrack, packets_, aux->flow-
> > > >dl_type,
> > >     > force,
> > >     >                            commit, zone, setmark, setlabel,
> helper);
> > >     >          break;
> > >     > diff --git a/lib/packets.h b/lib/packets.h
> > >     > index 7dbf6dd..2c46a15 100644
> > >     > --- a/lib/packets.h
> > >     > +++ b/lib/packets.h
> > >     > @@ -112,6 +112,8 @@ struct pkt_metadata {
> > >     >      struct flow_tnl tunnel;     /* Encapsulating tunnel
> parameters.
> > > Note
> > >     > that
> > >     >                                   * if 'ip_dst' == 0, the rest
> of
> > > the
> > >     > fields may
> > >     >                                   * be uninitialized. */
> [Sugesh] I don’t see the initiation of these fields. Have you verified the
> cost of these fields in pkt_metadata for normal traffic,?. I mean the
> performance impact.?

[Antonio]
For sure this is an overhead for normal traffic.


> > >     > +    void *prev_emc_entry;       /* EMC entry that this packet
> > > matched. */
> > >     > +    void *prev_flow;            /* Flow pointed by the matching
> EMC
> > >     > entry. */
> > >     >  };
> > >     >
> > >     >  static inline void
> > >     > --
> > >     > 2.4.11
> > >     >
> [Sugesh] I cannot apply the patch on latest master due to some conflicts.
> I am not sure , how does the stats get updated when you are combining the
> flows in datapath. Earlier there are two UUID for two flows and now only
> one UUID is being used.
> I don’t have a expertise in conntrack, However what would be the result
> when we have actions like
> 
> table=0, priority=100,ct_state=-trk,ip actions=clone(push_vlan(4),
> trunc(200), output:2), ct(table=1)
> table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2
> 
> 
> Also another example would be
> table=0, priority=100,ct_state=-trk,ip actions=push_vlan(4), ct(table=1)
> ,dec_ttl
> table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2
> 

[Antonio]
In general whenever you have some actions that imply recirculation 'before'
the ct action
   actions=<other recirc here>, ct(zone=N,recirc)
it should work by combining
   <other recirc here>
   + ct(zone=N)         <=== recirc gets removed
   <actions from recirc flow>


> As Joe suggested before, I think the logic of combine will be good to be
> in the ofproto-xlate layer than here. Though its userspace specific, it
> could be the same way how the userspace_tunnel handled there.
> 

[Antonio] yes, that's interesting


> 
> Also another concern is on the actions that does fields modification along
> with connection tracking. Probably it’s a valid use case with NAT. In this
> scenario, if the flow fields are changed by action, how does it possible
> to track back the original flow?
> Or am I missing anything ?
> 
> 
> 
> > >     > _______________________________________________
> > >     > dev mailing list
> > >     > [email protected]
> > >     > https://urldefense.proofpoint.com/v2/url?u=https-
> > > 3A__mail.openvswitch.org_mailman_listinfo_ovs-
> > > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-
> > > uZnsw&m=nVfAqNkwbufuhX-dkxok67-OaTEtfFTDAj-
> > > Z39vSDZE&s=7SujU3lZVLCtG1pTSrZt1CMmmuv4iKnSkprKdkxSXZY&e=
> > >     _______________________________________________
> > >     dev mailing list
> > >     [email protected]
> > >     https://urldefense.proofpoint.com/v2/url?u=https-
> > > 3A__mail.openvswitch.org_mailman_listinfo_ovs-
> > > 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-
> > > uZnsw&m=nVfAqNkwbufuhX-dkxok67-OaTEtfFTDAj-
> > > Z39vSDZE&s=7SujU3lZVLCtG1pTSrZt1CMmmuv4iKnSkprKdkxSXZY&e=
> > >
> >
> > _______________________________________________
> > dev mailing list
> > [email protected]
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to