On Fri, Dec 08, 2017 at 10:03:06AM +0000, Vishal Deep Ajmera wrote:
> On Wed, Dec 06, 2017 at 11:57:48AM +0000, Vishal Deep Ajmera wrote:
> > Hi Ben,> From: Ben Pfaff [mailto:b...@ovn.org]
> > Sent: Tuesday, December 05, 2017 12:52 AM
> > To: Vishal Deep Ajmera <vishal.deep.ajm...@ericsson.com>
> > Cc: d...@openvswitch.org
> > Subject: Re: [ovs-dev] [PATCH] ofproto-dpif-xlate: Incorrect handling 
> > of errors in group action processing
> > 
> > On Mon, Dec 04, 2017 at 08:47:38AM +0000, Vishal Deep Ajmera wrote:
> > > As per OpenFlow v1.3 specification, when an action list contains a 
> > > group action a copy of the packet is passed to the group for 
> > > processing by the group. This means that if there is an error 
> > > encountered during group processing, only the copy of packet should 
> > > be dropped, but subsequent actions in the action list should be executed 
> > > on the original packet.
> > > 
> > > Additionally, if the group type is "ALL", each action bucket of the 
> > > group should process a copy of the packet. If there is an error 
> > > while processing one bucket other buckets should still be processed.
> > > 
> > > Example 1:
> > > table=0,in_port=tap0 actions=output:tap1,group:10,output:tap2
> > > 
> > > Even if any error is encountered while processing the group action, 
> > > the packet should still be forwarded to ports tap1 and tap2.
> > > 
> > > Example 2:
> > > group_id=1,type=all,bucket=actions=output:tap1,bucket=actions=encap(
> > > et
> > > h)
> > > 
> > > Even if processing the action in the second bucket fails because the 
> > > packet already has an Ethernet header, the other copy of the packet 
> > > should still be processed by the first bucket and output to port tap1.
> > > 
> > > Currently the error handling in OVS does not comply with those rules. 
> > > When any group bucket execution fails the error is recorded in the 
> > > so-called "translation context" which is global for the processing 
> > > of the original packet. Once an error is recorded, OVS skips 
> > > processing subsequent buckets and installs a drop action in the 
> > > datapath even if parts of the action list were previously processed 
> > > successfully.
> > > 
> > > This patch clears the error flag after any bucket of a group is executed.
> > > This way the error encountered in processing any bucket of the group 
> > > will not impact other actions of action-list or other buckets of the 
> > > group.
> > > 
> > > Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajm...@ericsson.com>
> > > Signed-off-by: Keshav Gupta <keshav.gu...@ericsson.com>
> > 
> > Thank you for thinking about this issue.  Clearly there is something to be 
> > done here.
> > 
> > When I look at the uses of translation errors, I see now that there are a 
> > few categories:
> > 
> >    1. Translation can't start at all: XLATE_NO_RECIRCULATION_CONTEXT,
> >       XLATE_BRIDGE_NOT_FOUND, XLATE_RECIRCULATION_CONFLICT,
> >       XLATE_INVALID_TUNNEL_METADATA.
> > 
> >    2. Translation has run too long or used too much space and must end
> >       to bound memory or CPU time.  XLATE_RECURSION_TOO_DEEP and
> >       XLATE_TOO_MANY_RESUBMITS are in this category for time,
> >       XLATE_STACK_TOO_DEEP for space.
> > 
> >    3. Some step can't be executed: XLATE_TOO_MANY_MPLS_LABELS,
> >       XLATE_UNSUPPORTED_PACKET_TYPE.
> > 
> > I believe that the appropriate handling varies among category.  
> > Category
> > 1 is fundamentally unrecoverable.  For category 2, we must not try to 
> > recover because that would extend CPU time (perhaps exponentially) or it 
> > would cause malfunctions (because a later "pop" would miss out on some data 
> > that wasn't "push"ed).  For category 3, we could recover at an appropriate 
> > point in the way you suggest.
> > =======
> > [Vishal/Keshav] Any packet modification done in group bucket processing is 
> > local to bucket only and original packet is retained (as it was before 
> > entering the group processing) after bucket is executed. OVS takes care of 
> > this by restoring the flow back again in ctx->flow in xlate_group_bucket 
> > ().  Now if there are errors in any action in the bucket then the bucket 
> > processing will halt for the packet. Also any changes to the ctx->flow done 
> > in the bucket will not be committed in ctx->odp_actions. So, if we reset 
> > the ctx->error back to old value (which must be 0) we still will be able to 
> > process further actions in the action-list. 
> > 
> > Is our understanding correct ? Or are we missing some case which can 
> > potentially break because of this change ? We think that 
> > XLATE_RECURSION_TOO_DEEP and XLATE_TOO_MANY_RESUBMITS should be treated 
> > local for group processing. Ideally we should also be restoring the 
> > ctx->resubmits to old value after bucket processing is complete to make 
> > sure one bucket does not disrupt processing (exhaust number of resubmits) 
> > of other buckets or other actions of the action-list. ctx->depth parameter 
> > is already adjusted to original value after bucket processing is done in 
> > xlate_group_bucket ().
> > 
> > We agree that for cases where errors are global in nature and can cause 
> > data path issues should not be masked for e.g. XLATE_STACK_TOO_DEEP. Is 
> > there any more such errors we should not mask ?
> > =======
> 
> I don't think you understand why XLATE_RECURSION_TOO_DEEP and 
> XLATE_TOO_MANY_RESUBMITS exist.  They protect translation from running too 
> long.  If we reset and ignore them across group bucket processing, the 
> maximum runtime is multiplied by the number of buckets that can execute, that 
> is, the maximum runtime increases exponentially.  This is not acceptable.
> 
> [Vishal] After looking at the code a bit deeper, I realize that 
> XLATE_TOO_MANY_RESUBMITS  is to limit the total number of table traversal 
> (backward or forward), thereby limiting the time spent in translation. So 
> masking this error for every bucket will defeat the purpose of having it at 
> the first place.
> 
> But I still not completely understood XLATE_RECURSION_TOO_DEEP. Let me try to 
> explain. Currently in OVS, ctx->depth is incremented before bucket actions 
> are executed and then it is decremented (thus restoring it back to old value) 
> before entering another bucket. This means whenever we jump from one group to 
> another group the depth will increase. Similarly when we jump to lower table 
> from current table, the depth will increase. So essentially it is protecting 
> translation from going too deep in recursion for any given path. However, 
> whenever this limit is reached in any path the complete translation is 
> stopped and drop flow is installed. Currently we treat this error as global 
> limit but does it make sense to penalize only the path which caused this 
> error, and execute other paths (buckets) ? Let me know your opinion for the 
> same. I shall send V2 patch for review covering at-least category-3 errors. 

It's debatable.  I think that too-deep recursion indicates a serious bug
in the flow table that should be fixed, and I'd prefer to call attention
to it this way.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to