Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Rob Shakir Mon, 28 Jan 2013 10:49:31 -0800

Hi Bruno,

Thanks for the review of this version of the draft. I've added some feedback 
in-line as [rjs]. My apologies for the delay in responding.


On 8 Jan 2013, at 10:57, [email protected] wrote:

> 1)  Critical error (§3)
>  
> IMHO, the term “critical error” is mixing both technical/protocol 
> considerations (e.g. can’t read the update) and requirements considerations 
> (BGP sessions state is too degraded and I prefer shutting it down rather than 
> running on a degraded mode) which IMHO is unfortunate and does not help the 
> discussion. I’d much prefer that we distinguish both by defining technical 
> levels of errors and then defining the requirements for each plus  the 
> consequences/drawbacks of the decision (whether to keep or shut the session).
> For the protocol standpoint, I would propose the following level of errors, 
> based on the protocol encoding layers: session, update, attribute.
> - attribute level error: semantic or syntax error in the attribute value or 
> attribute flags
> - session level error: error in the update length / marker. i.e. if skipping 
> the update length I can’t find the marker of the next bgp message.
> - update level error: any other error in the update message
>  
> We can further distinguish if the NLRIs can be parsed or not.

[rjs]: I would observe that there are two ways that we can consider how to 
classify errors here -- one based on the definition of the impact on the UPDATE 
of errors, and then one based on the reaction to those errors. The current 
draft (clearly) takes the latter approach for classification and reaction, 
however, as you say, it could be advantageous to classify the significance of 
the error to determine how "broken" the UPDATE is, and then map this to the 
possible approaches for handling the error.

[rjs]: If we were to take this approach -- then we could end up with a mapping 
of errors of:
        - For attribute-level errors, if it is not the NLRI-carrying attribute 
affected, then this is NLRI-level error handling, otherwise use session level 
error handling.
        - For session-level errors, only use session-level error handling.
        - For UPDATE-level errors, if the NLRI attribute can be parsed, then 
use error handling targeted to the NLRI, else handle it at a session level.
I am not clear where we would have UPDATE errors that do not fall within either 
the attribute, or session categories - do you have any example to help me 
understand? Also, do you envisage cases where there are session-level errors 
that we would map to any NLRI-level error handling?

[rjs]: It does sound advantageous to note the caveats of holding the session up 
for each type of error -- I will work to add a paragraph to § 3 that describes 
the motivation for wanting to hold the session up in some cases, and the 
drawbacks of doing so.


>  2)  Business Requirements
> In the current text, I found the requirements a bit too technically oriented. 
> I’d rather add business requirements independent of the current solutions. I 
> would propose:
>  
> In VPN networks, VPN are supposed to be isolated from each others and from 
> the others services (most notably the Internet). Hence, an error on 
> routes/BGP messages related to a VPN SHOULD NOT negatively impact others VPN. 
> Similarly, an error on routes/BGP messages related to a non VPN service 
> SHOULD not negatively impact the VPN service.
> In Internet networks, ASes are supposed to be Autonomous. Hence an error on 
> routes/BGP messages originated by an AS SHOULD NOT negatively impact 
> destinations originated from others ASes.
>  
> By “negatively impact”, we mean losing reachability for a destination (NLRI), 
> typically by losing all the paths in the Loc-RIB to that destination (NLRI). 
> Note that those paths may be learnt through multiple BGP sessions and hence 
> the requirement span multiple BGP sessions. The consequence is that if the 
> BGP error is believed to be limited to a single BGP session (e.g. a session 
> level error), then in a network with redundancy, the destination is believed 
> to be still known through another session and hence the session MAY be chosen 
> to be shutdown and all path learned from that session removed. On the 
> contrary, if the BGP error has a chance to be also met on the redundant 
> paths/sessions, then the BGP session and the routes learned from that session 
> SHOULD be preserved, until the negatives consequences are considered too 
> important. When evaluating those consequences, the fact that all redundant 
> paths/sessions may suffer from the same error and hence will inherit the same 
> decision MUST be considered.

[rjs]: I will go through and review this section to try and align it more with 
the service/business requirements for BGP deployments. It strikes me that the 
suggestion above is more related to an additional point that is not clearly 
included in this section around the different requirements for differing 
networks in which BGP is deployed. I would suggest that this is something that 
is added to the latter part of §2, and the existing text remains. I'm keen that 
we provide some background as to *why* there is motivation for change in terms 
of deployment characteristics, as well as covering the business requirements 
you mention above.

>  
> As an illustration, we typically seek to avoid that because of a single BGP 
> error a PE lose both its redundant iBGP session with its BGP RR. And by “a 
> PE” I really mean all PE experiencing this condition. Could easily be 10s of 
> PE, even 100s.
>  
> 3)  Technical requirements
> For session level error, the BGP session is dead so need to be 
> shutdown/graceful shutdown/graceful restart. If the update length is set to 
> the number of octets sent to the peer (or vice versa) rather than computed 
> based on the content of the update, there is a chance to 1) limit the number 
> of such session level errors and 2) increase the probability that this error 
> is local to that session and not likely to happen on a redundant/backup 
> session. There is probably a limited part of the BGP code which needs to be 
> hardened to reduce such unrecoverable errors. And if those errors are still 
> frequent, we may further propose technical solutions (e.g. replacing TCP by 
> SCTP which can provides message boundaries, among others things (e.g. some 
> benefits of multi-sessions))
>  
> For attribute & update level error when the NLRI can be parsed, cf 
> draft-error-handling (treat as withdraw).

[rjs]: AIUI, if we added this requirement, then we could say that the total 
UPDATE length should be trusted as the "real" length of the transmitted UPDATE 
(which would be further validated by the subsequent presence of the marker). In 
this case, (and I expect we are getting towards draft-ietf-idr-error-handling 
here), then do you think that there is a capability required to indicate that 
an implementation has used this method of calculation? Without one, then we 
have the ambiguity of whether an implementation used this "trick" and hence are 
not clear whether we should trust it.

> Now let the discussion begin J. For attribute & update level error when the 
> NLRI cannot be extracted IMHO there is room for discussion and analysis of 
> the consequences.
>  
> “since the NLRI cannot be extracted, error handling mechanisms must be 
> applied at the per-session level” (§5)
> Well, IMO, this is a choice to be made rather than a “must”.

[rjs]: Do you envisage that this is a requirement in all scenarios, or a 
special case to be able to hold the session up following repeated errors? If 
during normal operation one tries to apply treat-as-withdraw, then this cannot 
be done (safely) unless we can determine to which NLRI this should be applied 
to. I'm unclear whether this not being a MUST (although at the moment it's a 
lower-case 'must') really implies that we have a requirement for a solution 
akin to the persistence draft as a "last resort" mechanism?

[rjs]: I think that this in-line with your later discussion -- essentially, the 
different levels as to how conservative one might want to be are very black and 
white at the current time (within the draft), as it's really whether you have 
these mechanisms "on", or "off". Is your suggestion that we evaluate more 
levels of error handling (i.e., include the "ignore all errors and continue 
operating") within this document, or is it an evaluation between the current 
on/off levels? Extending the draft to cover the "hold up" use case potentially 
expands it outside of BGP error handling that is applicable to most deployments 
of BGP into more special cases in my view. I'd like to understand whether the 
working group feels that this problem space falls within the scope of this 
draft.

Thanks,
r.

> If we were to skip a BGP update:
> For Internet, probably the worst case would be to miss a BGP update with a 
> loop in the AS path and hence create a loop for me and my upstream ASes for 
> the NLRI in the missed updated. How much probable is this? 0 for iBGP 
> sessions. TBE for eBGP. Then what would be the consequences? loss of 
> connectivity for the NLRI until the problem is manually solved by an AS 
> between the origin and me, possible forwarding congestions for others. I’m 
> not sure I care too much about loosing reachability to NLRI in faulty BGP 
> update as most likely, if only one BGP update (out of millions) is faulty, 
> the reason may come from the origin AS playing with a specific bit or 
> attribute and if they chose to play with their update, they should bear the 
> responsibility. To be compared by the probability of losing all redundant 
> paths (if the error is seen on redundant path) and the consequences (PE 
> -possibly all PEs- down).
>  
> For VPN, probably the worst case would be to keep a VPN label previously 
> allocated to VPN 1 and re-allocated to another VPN (VPN breach Cf 
> http://tools.ietf.org/html/draft-uttaro-idr-bgp-persistence-01#section-8)
> Again, the pro and con could be discussed (e.g. possibly one way partial VPN 
> breach for some time (that basically no one can exploit) vs all VPN/PE being 
> down. IMHO, if we believe such issue could be corrected in 30-60 minutes, I 
> would probably favor keeping the session up.
>  
> From the lively discussions, looks like the opinions may vary depending on 
> the AS, people and circumstances. E.g. how much my redundant BGP paths are 
> failure independent? (e.g. use different BGP implementations)
> As such, what about defining severity levels for BGP error handling? As one 
> may wish to accept only low severity errors while others may be willing to 
> accept high severity errors (including when the NLRI cannot be found) e.g. 
> the network has been down for 30 minutes, while waiting for the patch, one 
> may want to be able to restore some service at all costs (can’t possibly be 
> worst).
>  
> Again, IMHO it would be good to discuss the drawbacks depending on the 
> situation (iBGP, eBGP; hop by hop routed, tunneled …) in this requirement 
> document to make sure we are all on the same page, we have constructive 
> discussions and SP enabling revised error handling are fully aware of the 
> consequences.
>  
> 4)  Security consideration
> In §7 “security considerations” I would discuss the fact that current BGP 
> error handling (or a (too) strict one) could be exploited by attackers to 
> create a remote DOS attack.
> Should we also ask a review of the SIDR WG since “The purpose of the SIDR 
> working group is to reduce vulnerabilities in the inter-domain routing 
> system.” ? ...
>  
> Best regards,
> Bruno
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
> >-----Original Message-----
> >From: [email protected] [mailto:[email protected]] On Behalf Of Rob
> >Shakir
> >Sent: Thursday, December 27, 2012 7:44 PM
> >To: [email protected]
> >Subject: [Idr] Fwd: [GROW] I-D Action: 
> >draft-ietf-grow-ops-reqs-for-bgp-error-
> >handling-06.txt
> > 
> >Hi IDR!
> > 
> >FYI -- please find an updated relating to a new version of 
> >draft-ietf-grow-ops-
> >reqs-for-bgp-error-handling.
> > 
> >Any comments very welcome (to me or grow@).
> > 
> >Seasons greetings!
> >r.
> > 
> >Begin forwarded message:
> > 
> >> From: <[email protected]>
> >> Subject: Re: [GROW] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-
> >handling-06.txt
> >> Date: 27 December 2012 18:41:50 GMT
> >> To: <[email protected]>, <[email protected]>
> >> Cc: [email protected]
> >> 
> >> On 27/12/2012 18:35, "[email protected]" <[email protected]>
> >> wrote:
> >> 
> >>> 
> >>> A New Internet-Draft is available from the on-line Internet-Drafts
> >>> directories.
> >>> This draft is a work item of the Global Routing Operations Working Group
> >>> of the IETF.
> >>> 
> >>>   Title           : Operational Requirements for Enhanced Error Handling
> >>> Behaviour in BGP-4
> >>>   Author(s)       : Rob Shakir
> >>>   Filename        : draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt
> >>>   Pages           : 19
> >>>   Date            : 2012-12-27
> >> 
> >> Hi GROW!
> >> 
> >> This update is a fairly major re-spin of the BGP Error Handling
> >> requirements draft. The technical content should be as per the previous
> >> revisions however, following the ietf/RtgDir last call comments, I have
> >> made the following changes:
> >> 
> >> * Made the amendments that were discussed and there was no disagreement
> >> with from our meeting in Atlanta -- this is essentially renaming the
> >> Critical/Semantic error types to Critical/Non-Critical.
> >> 
> >> * Significant de-duplication within the text including merging the
> >> operational monitoring/toolset discussions into the error handling
> >> sections.
> >> 
> >> * Adoption of rfc2119 language throughout to clarify the requirements.
> >> 
> >> * Removal of some of the discussion around more detailed justifications
> >> for why particular decisions were made. I think this was useful through
> >> the discussion phase of this draft, but it seems like GROW/IDR have
> >> converged on a relatively stable set of requirements, so I have trimmed
> >> back some of this discussion.
> >> 
> >> I'd really welcome any further comments on this before we re-submit for
> >> publication. To eke these out - Peter/Chris - can you kick off a WGLC for
> >> this draft please? :-)
> >> 
> >> Seasons greetings!
> >> r.
> >> 
> >> _______________________________________________
> >> GROW mailing list
> >> [email protected]
> >> https://www.ietf.org/mailman/listinfo/grow
> > 
> >_______________________________________________
> >Idr mailing list
> >[email protected]
> >https://www.ietf.org/mailman/listinfo/idr
> _________________________________________________________________________________________________________________________
> 
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> electroniques etant susceptibles d'alteration,
> France Telecom - Orange decline toute responsabilite si ce message a ete 
> altere, deforme ou falsifie. Merci.
> 
> This message and its attachments may contain confidential or privileged 
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete 
> this message and its attachments.
> As emails may be altered, France Telecom - Orange is not liable for messages 
> that have been modified, changed or falsified.
> Thank you.
> 

_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Reply via email to