Re: [Lsr] BFD aspects

Greg Mirsky Tue, 30 Nov 2021 08:31:48 -0800

Hi Robert,
thank you for your kind words and the discussion. Please find my notes
in-lined below and tagged GIM2>>.


Regards,
Greg


On Tue, Nov 30, 2021 at 1:11 AM Robert Raszuk <rob...@raszuk.net> wrote:

> Greg,
>
> Thank you so much for your input to this discussion. As you can see it is
> not easy to convince some folks :)
>
> I just want to clarify one thing in respect to using multihop BFD between
> RR and PE. There is nothing about data plane path detection with that
> suggestion. Basically BFD is used here as a better ping. No more no less.
>
> Few points:
>
> #1 - Yes, if my control plane RR-PE network fails and the normal data
> plane to PE is still up I will have a false positive. Solution: Use more
> than one RR.
>
GIM2>> Now we have to reconcile states reported by RRs. Doable but adds
complexity.

>
> #2 - Yes BFD process liveness or not on PE does not guarantee service
> liveness of the PE - that is always true as BFD does not check real service
> data plane processing anyway
>
GIM2>> Agree,

>
> #3 - If my network to PE fails but RR-PE works fine PE will be considered
> alive. Network down detection is not the goal for discussing PUA/PULSE.
>
GIM2>> Thank you for the clarification. Though I wonder how such separation
benefits network operation.

>
> Cheers,
> R.
>
>
>
>
>
> On Tue, Nov 30, 2021 at 5:08 AM Greg Mirsky <gregimir...@gmail.com> wrote:
>
>> Hi Aijun,
>> thank you for clarifying your goal. I have missed asking another question:
>>
>> What is the required failure detection time?
>>
>> For example, a 10 ms detection guarantee is required for local
>> protection. And that results in a 3.3 ms interval between the fault
>> detection packets (e.g., CCM or BFD). As I understand it, IGP is likely to
>> rely on single-hop BFD detection. Hence, 10 ms before PE's neighbor
>> discovers the failure. Then the IGP processes will start acting. Thus, I
>> don't see how IGP can guarantee anything less than 10 ms. Would you agree?
>>
>> Regards,
>> Greg
>>
>> On Mon, Nov 29, 2021 at 7:38 PM Aijun Wang <wangai...@tsinghua.org.cn>
>> wrote:
>>
>>> Hi, Greg:
>>>
>>>
>>>
>>> I understand that BFD can get the guaranteed failure detection time than
>>> other protocol that depends on the size of the network.
>>>
>>> What we want to emphasize is that the balance of deployment/operation
>>> overhead and the efficiency of the proposed solutions.
>>>
>>> For your questions, I think we can still get the millisecond failure
>>> detection time via the IGP itself(Far faster than the BGP hello timer for
>>> BGP use case; and also benefit for the tunnel services that has no hello
>>> timer).
>>>
>>> The actual time should certainly be verified later in simulation
>>> environment or in real network deployment.
>>>
>>>
>>>
>>> Best Regards
>>>
>>>
>>>
>>> Aijun Wang
>>>
>>> China Telecom
>>>
>>>
>>>
>>> *From:* Greg Mirsky <gregimir...@gmail.com>
>>> *Sent:* Tuesday, November 30, 2021 11:11 AM
>>> *To:* Aijun Wang <wangai...@tsinghua.org.cn>
>>> *Cc:* lsr <lsr@ietf.org>; Gyan Mishra <hayabusa...@gmail.com>; Robert
>>> Raszuk <rob...@raszuk.net>
>>> *Subject:* Re: [Lsr] BFD aspects
>>>
>>>
>>>
>>> Hi Aijun,
>>>
>>> what is the guaranteed failure detection time for the IGP-based solution?
>>>
>>>
>>>
>>> Regards,
>>>
>>> Greg
>>>
>>>
>>>
>>> On Mon, Nov 29, 2021 at 7:07 PM Aijun Wang <wangai...@tsinghua.org.cn>
>>> wrote:
>>>
>>> Hi, Greg:
>>>
>>>
>>>
>>> Even the BFD auto-configuration extensions has been standardized and
>>> implemented, won’t the network be filled with the detect packets,
>>> instead of the user packets?
>>>
>>> For PUA/PULSE solution, the mentioned LSA will only be emerged when the
>>> node status change from “UP” to “DOWN”, but the BFD packet will be sent
>>> continuously when these PEs are active.
>>>
>>> Which one is efficient?
>>>
>>>
>>>
>>> Certainly, we will consider the massive failure situations, even it will
>>> occur in very rare circumstances.
>>>
>>>
>>>
>>> Best Regards
>>>
>>>
>>>
>>> Aijun Wang
>>>
>>> China Telecom
>>>
>>>
>>>
>>> *From:* Greg Mirsky <gregimir...@gmail.com>
>>> *Sent:* Tuesday, November 30, 2021 10:47 AM
>>> *To:* Aijun Wang <wangai...@tsinghua.org.cn>
>>> *Cc:* lsr <lsr@ietf.org>; Gyan Mishra <hayabusa...@gmail.com>; Robert
>>> Raszuk <rob...@raszuk.net>
>>> *Subject:* Re: [Lsr] BFD aspects
>>>
>>>
>>>
>>> Hi Aijun,
>>>
>>> thank you for confirming that it is not the conclusion one can arrive
>>> based on my discussion with Robert. Secondly, the problem you describe, I
>>> wouldn't characterize as a scaling issue with using multi-hop BFD
>>> monitoring path continuity in the underlay network. In my opinion, it is an
>>> operational overhead that can be addressed by an intelligent management
>>> plane or a few extensions in the control plane that is setting an overlay.
>>> Since the management plane is usually a proprietary solution, I invite
>>> anyone interested in working on BFD auto-configuration extensions in the
>>> control plane. I much appreciate references to the use cases that can
>>> benefit from such extensions.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Greg
>>>
>>>
>>>
>>> On Mon, Nov 29, 2021 at 6:26 PM Aijun Wang <wangai...@tsinghua.org.cn>
>>> wrote:
>>>
>>> Hi, Greg:
>>>
>>>
>>>
>>> Firstly, regardless of which methods to be used for the multihop BFD
>>> approach, it is certainly the configuration overhead if you image there are
>>> 10,000 PEs as Tony often raised as one example.
>>>
>>> Shouldn’t you configure each pair of them to detect the PE-PE
>>> connection?
>>>
>>> It is obvious not scalable.
>>>
>>>
>>>
>>>
>>>
>>> Best Regards
>>>
>>>
>>>
>>> Aijun Wang
>>>
>>> China Telecom
>>>
>>>
>>>
>>> *From:* Greg Mirsky <gregimir...@gmail.com>
>>> *Sent:* Tuesday, November 30, 2021 10:18 AM
>>> *To:* Aijun Wang <wangai...@tsinghua.org.cn>
>>> *Cc:* Gyan Mishra <hayabusa...@gmail.com>; Robert Raszuk <
>>> rob...@raszuk.net>; lsr <lsr@ietf.org>
>>> *Subject:* Re: [Lsr] BFD aspects
>>>
>>>
>>>
>>> Hi Aijun,
>>>
>>> could you please elaborate on how you see that this discussion leads to
>>> the "BFD based detection for the mentioned problem is not [...]
>>> scalable(among PEs)" conclusion? I hope that there's nothing I've said or
>>> suggested lead you to this conclusion. Personally, I believe that BFD-based
>>> PE-PE is the best technical solution. I understand that an operator may be
>>> dissatisfied with the additional configuration of the BFD session. As
>>> noted, I believe that can be addressed in the management plane or minor
>>> extensions in the control plane (BGP or not). If a particular
>>> implementation (or a combination of the implementation and HW) has a
>>> scaling challenge with multi-hop BFD, then that could be not enough
>>> sufficient technical justification for a somewhat controversial proposal.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Greg
>>>
>>>
>>>
>>> On Mon, Nov 29, 2021 at 5:17 PM Aijun Wang <wangai...@tsinghua.org.cn>
>>> wrote:
>>>
>>> From the discussion, I think we can get the conclusion that BFD based
>>> detection for the mentioned problem is not reliable (between PE/RR) and
>>> scalable(among PEs).
>>>
>>> Then also the BGP based solution.
>>>
>>>
>>>
>>> So let’s focus how to implement it within the IGP?  Thanks Greg’s
>>> analysis.
>>>
>>> And one supplement for Robert’s comments: RR is always not located
>>> within the same area as PEs, then can’t know the down of PE nodes
>>> immediately when the summary is configured between areas.
>>>
>>>
>>>
>>> Best Regards
>>>
>>>
>>>
>>> Aijun Wang
>>>
>>> China Telecom
>>>
>>>
>>>
>>> *From:* lsr-boun...@ietf.org <lsr-boun...@ietf.org> *On Behalf Of *Gyan
>>> Mishra
>>> *Sent:* Tuesday, November 30, 2021 8:44 AM
>>> *To:* Robert Raszuk <rob...@raszuk.net>
>>> *Cc:* Greg Mirsky <gregimir...@gmail.com>; lsr <lsr@ietf.org>
>>> *Subject:* Re: [Lsr] BFD aspects
>>>
>>>
>>>
>>>
>>>
>>> Robert
>>>
>>>
>>>
>>> On Mon, Nov 29, 2021 at 7:35 PM Robert Raszuk <rob...@raszuk.net> wrote:
>>>
>>> Hi Greg,
>>>
>>>
>>>
>>> If BFD would have autodiscovery built in, that would indeed be the
>>> ultimate solution. Of course folks will worry about scaling and number of
>>> BFD sessions to be run PE-PE.
>>>
>>> GIM>> I sense that it is not "BFD autodiscovery" but an advertisement of
>>> BFD multi-hop system readiness to the particular PE. That, as I think of
>>> it, can be done in a control or management plane.
>>>
>>>
>>>
>>> Agreed.
>>>
>>>
>>>
>>> But if BFD between all PEs would be an option why RR to PE in the local
>>> area would not be a viable solution ?
>>>
>>>
>>>
>>> GIM>>Because, in the case of PE-PE, BFD control packets will be
>>> fate-sharing with data packets. But the path between RR and PE might not be
>>> used for carrying data packets at all.
>>>
>>>
>>>
>>> 100%. But that was accounted for. Reason being that you have at least
>>> two RRs in an area. The point of BFD was to use detect that PE went down.
>>>
>>>
>>>
>>> Gyan> What Greg is alluding is a very good point to consider is that the
>>> RR in many cases in operator networks sit in the “control plane” path
>>> which is separate from the data plane path.  So the E2E forwarding plane
>>> path between the PEs, the RR has no knowledge as is it sits outside the
>>> forwarding plane path.  That being said the PE to RR path is disjoint from
>>> the PE-PE path so from the PE-RR  RR POV may think the PE is up or down
>>> thus the false positive or negative. That would be the case regardless of
>>> how many RRs are deployed.
>>>
>>>
>>>
>>> You are absolutely right that it may report RR disconnect from the
>>> network while PE is up and data plane from remote PEs can reach it. That is
>>> why we have more than one RR.
>>>
>>>
>>>
>>> As far as fate sharing PE-PE BFD with real user data - I think it is not
>>> always the case. But this is completely separate discussion :)
>>>
>>>
>>>
>>> Also please keep in mind that PE going down can be learned by RRs by
>>> listening to the IGP. No BFD needed.
>>>
>>>
>>>
>>> Both would be multihop, both would be subject to all transit failures
>>> etc ...
>>>
>>> GIM>> I think that there's a difference between the impact a path
>>> failure has on the data traffic. In the case of monitoring PE-PE path in
>>> the underlay and using the same encapsulation as data traffic is
>>> representative of the data experience. A failure of the PE-RR path, in my
>>> understanding, may be not representative at all. BFD session between RR and
>>> PE may fail while PE is absolutely functional from the service PoV.
>>>
>>>
>>>
>>> Please keep in mind that this entire discussion is not about data plane
>>> failure end to end :)  Yes, it's pretty sad. This entire debate  is to
>>> indicate domain wide that the IGP component on a PE went down.
>>>
>>>
>>>
>>> No one considers data plane liveness and even as you observed data plane
>>> encapsulation congruence. Clearly this is not a true OAM discussion.
>>>
>>>
>>>
>>> On the other hand, PE might be disconnected from the service while the
>>> BFD session to RR is in the Up state.
>>>
>>>
>>>
>>> Not likely if you keep in mind that to trigger any remote action such
>>> failure would have to happen to all RRs.
>>>
>>>
>>>
>>> Thx a lot,
>>> R.
>>>
>>>
>>>
>>> _______________________________________________
>>> Lsr mailing list
>>> Lsr@ietf.org
>>> https://www.ietf.org/mailman/listinfo/lsr
>>>
>>> --
>>>
>>> <http://www.verizon.com/>
>>>
>>> *Gyan Mishra*
>>>
>>> *Network Solutions Architect *
>>>
>>> *Email gyan.s.mis...@verizon.com <gyan.s.mis...@verizon.com>*
>>>
>>> *M 301 502-1347*
>>>
>>>
>>>
>>> _______________________________________________
>>> Lsr mailing list
>>> Lsr@ietf.org
>>> https://www.ietf.org/mailman/listinfo/lsr
>>>
>>>

_______________________________________________
Lsr mailing list
Lsr@ietf.org
https://www.ietf.org/mailman/listinfo/lsr

Re: [Lsr] BFD aspects

Reply via email to