Henk -
Top posting for better readability.
1)Whether we should publicly document what smart implementers have done is not
relevant here (though I appreciate this thread has sparked interest). If folks
want to discuss this please do so in a separate thread.
The only point I am making is that there are implementations which have already
solved the problems you are trying to address by using TCP and have done so w/o
requiring protocol extensions. These implementations will not benefit from your
proposal.
2)Your solution introduces two sub-sessions associated with an adjacency - one
the traditional hello exchange over Layer 2 - the other a "TCP-like" session
used to support the IS-IS Update process (LSP flooding).
Both sub-sessions MUST be in the UP state in order for an adjacency to be
usable. If either of the sub-sessions is down then the adjacency isn't usable
and advertisement of the neighbor MUST be withdrawn. Otherwise we risk
blackholes.
3)Because the fate of the two sub-sessions is not guaranteed share fate, you
have to deal with this issue conceptually similar to the way RFC 6213 has done.
This is an essential part of the solution which you must define.
4)It is true BFD does not detect all possible failures - and it is possible
there are still scenarios where BFD session state might not always share fate
with the corresponding address-family data packets. If you have ideas for
improvements here that would certainly be of interest to the IETF community.
But BFD does work quite well and is widely deployed and its use to prevent IGP
adjacency UP unless BFD session is also up has definitely solved real world
problems.
5)The suppression of IS-IS adjacency advertisement until completion of initial
LSPDB sync ("be like OSPF") has been addressed by use of the Suppress
Adjacency(SA) bit as defined in RFC 5306 (IS-IS Restart). Note SA bit applies
to cold start - not restart (do not be misled by the name of the containing
document).
6)CSNP reliability is a known issue. ISO 10589 addressed this issue by
requiring that SRM bit be set for all LSPs in the local LSPDB on adjacency UP
in addition to sending a CSNP set. As an aside, ISO 10589 did this even though
at the time (circa 1990) the underlying P2P sub-networks were all reliable
(think LAPB, LAPD, X.25). Belt and suspenders..
In cases where both of the neighbors already have (most of) the LSPDB, the loss
of a CSNP results in unnecessary flooding of many LSPs. Avoiding this cost may
be ameliorated by making initial CSNP exchange reliable (RFC 5306 includes this
as part of the restart mechanism). But we are dealing w transitory failures -
and TCP is not immune to this The fact that no IS-IS code is required to
recover from this when TCP is used does not mean that retransmissions come for
free.
There is no free lunch.
Les
> -----Original Message-----
> From: Henk Smit <[email protected]>
> Sent: Wednesday, November 07, 2018 2:22 PM
> To: Les Ginsberg (ginsberg) <[email protected]>
> Cc: [email protected]; [email protected]
> Subject: Re: [Lsr] IS-IS over TCP
>
>
>
> Les Ginsberg (ginsberg) wrote 2018-11-07 17:06:
>
> >> The problem that RFC6213 tries to solve is a case where one of the
> >> neighbors is thinking that the other does not support BFD. And thus
> >> the lack of BFD is not used as an indication that something is wrong.
> >> Right ?
> >>
> > [Les:] This is not correct.
> > The key paragraph is in https://tools.ietf.org/html/rfc6213#section-2
> >
> > " The problem with this solution is that it assumes that the
> > transmission and receipt of IS-IS Hellos (IIHs) shares fate with
> > forwarded data packets. This is not a fair assumption to make given
> > that the primary use of BFD is to protect IPv4 (and IPv6)
> > forwarding,
> > and IS-IS does not utilize IPv4 or IPv6 for sending or receiving its
> > hellos."
> >
> > We have seen cases where IPv4/IPv6 data packet delivery has been
> > compromised - but IS-IS PDU delivery was unaffected. This led to the
> > following behavior:
> >
> > 1)IS-IS exchanges hellos and bring adjacency up. Routes using the link
> > are installed
> > 2)BFD session is started and comes up
> > 3)After a time some problem occurs which only impact IP traffic - BFD
> > session goes down.
> > 4)IS-IS adjacency is brought down due to BFD session down, but IS-IS
> > continues to send hellos. If they are successfully exchanged then
> > IS-IS adjacency is almost immediately restored and we resume
> > installing IP routes using the link even though BFD session never
> > comes up. Data traffic gets dropped.
>
> Thanks for the explanation.
> But that is kinda what I wanted to say.
> BFD is failing (because the IP-path is failing), and IS-IS doesn't
> realize this, because it thinks that BFD isn't being used (because
> "BFD session never comes up".
>
> > The extensions in RFC 6213 allow IS-IS to know when both sides support
> > BFD, which means the sequence changes to:
> >
> > 1)IS-IS exchanges hellos - but adjacency remains in INIT state.
> > 2)BFD session is initiated - IS-IS adjacency remains in INIT state
> > until BFD session comes up.
> >
> > Thus IS-IS never installs routes using the link unless we know IP
> > traffic can be successfully delivered.
> > We can then use BFD both as a requirement to bring the adjacency up
> > AND as fast failure detection.
>
> OK. What do you suggest we do to fix this failure case if we do
> flooding over TCP ?
>
> - We could make BFD mandatory when flooding over TCP ? If the IP-path
> is broken, TCP will fail, but BFD will also fail ?
>
> - We bring the adjacency to UP state, but we don't include it in our own
> LSP immediately. Only after the TCP session has been established, we
> advertise the new adjacency in our LSP. Would that be enough ? It would
> stop routes from being calculated over the new adjacency.
> Maybe wait until the TCP connection has been set up, and a pair of IIHs
> has been exchanged over it ? (So authentication and other stuff can
> be verified for the TCP session).
> Or maybe even wait until IIHs have been sent, and then full sets of
> CSNPs
> are exchanged in both directions ?
>
> That last suggestion starts approaching the way OSPF does this. If I
> recall
> correctly, OSPF will only include adjacencies in its type-1 LSA after
> DDs
> have been exchanged, and the full LDSB has been synchronized. Would you
> want IS-IS to do the same ?
>
> - If the TCP session breaks, do you want to stop including the adjacency
> in the LSP ? This will make things like NSF, process restart and control
> plane failover much harder.
>
> >> What if two routers can exchange IIHs and do proper flooding of
> >> LSPs. But they can not exchange IP packets ? This could happen.
> >> IS-IS does not have a way to deal with this.
> >
> > [Les:] RFC 6213 was written precisely to address this case - and works
> > very well.
>
> The fact that BFD is working does not mean it is 100% sure that aal
> IP traffic will work. Failures might depend on protocol number,
> portnumbers, packetsize, etc. I agree that it is likely that if BFD
> works, all of IP will work. But it's no guarantee.
> Likewise we have to decide how paranoia we want to be that if IIHs
> are exchanged, how sure are we that TCP can exchange LSPs as well ?
>
> Maybe a good compromise would be:
> 1) don't advertise the adjacency in your LSPs until the TCP flooding
> connection has been established. (And maybe IIHs/CSNPs are exchanged).
> 2) after connection is fully up (IIHs and flooding works), use longer
> time-outs to determine whether TCP is still working.
> 3) when the other side closes a TCP connection (by FIN or RST), don't
> stop advertising the adjacency in your LSP immediately. In stead,
> for the next 10 seconds or so, try to re-establish the TCP connection
> first. If re-establishment doesn't work, then the router can stop
> including the adjacency in its LSP.
>
> This would prevent routers advertising new adjacencies that have a
> problem
> with TCP. But if it works, and suddenly stops working, convergence is
> slower (10 seconds or so). But the protocol has the ability to
> re-establish
> the TCP session, to make it more flexible.
>
> Would that be acceptable ?
>
> > [Les:] Actually they have. :-) That's why we wrote RFC 6213 - because
> > the problem has been seen in the field.
>
> I was talking about a router forwarding traffic.
> BFD between router A and router B could work, but B (for some reason)
> doesn't forward traffic. That was my extreme example to show that BFD
> doesn't protect all weird failure scenarios possible. The question is:
> how paranoia does a protocol want to be ?
>
> > [Les:] I appreciate your concerns - and it is true that "most of the
> > time" if we cannot receive IP traffic we likely cannot receive IS-IS
> > PDUs . But actual field experience has shown this is not always the
> > case - and when it is not the lack of recognition of the failed state
> > has major impacts.
> >
> > I don't think you can ignore this case just because it will happen
> > infrequently.
>
> We got BFD to protect against that.
> But if I understand correctly, you propose that if the TCP session
> doesn't
> come up, or get torn down later, we also bring down the adjacency, even
> when IIHs are still exchanged. That's reasonable, of course. Even if it
> makes
> things slightly more complicated.
>
> > [Les:] I am not suggesting that this aspect of scaling can be ignored.
> > But I am saying implementations have successfully addressed this with
> > "smarter implementations" and done so w/o requiring protocol
> > extensions.
>
> I personally agree with this position.
> Vendors should be able to distinguish themselves from the competition.
>
> But people working for operators have a different view.
> When this draft got published, we got private email from an operator
> addressing a number of issues (mainly when using unnumbered interfaces).
> His point was: he wanted to make sure that all corner-cases are handled
> identically by all vendors. Or at least in an inter-operable way.
> Therefor he wanted every little detail of implementation explicitly
> mentioned. I guess seeing your backbone go down because of
> vendor-specific
> enhancements (or lack thereof) will give you a different view on the
> matter.
>
> So I think that if vendors do things that are noticable on the wire,
> even
> if they should be compatible with standard behaviour, those things
> should
> be documented at least. Therefor I thought: "if all vendors are gonna
> implement their own pacing and their own reliability and their own
> high throughput, why not all use the same well-known transport
> protocol".
> Hence our proposal.
>
> > [Les:] I think the details of the "smart implementations" have not
> > been formally documented because they did not require protocol
> > extensions (i.e., they are interoperable w implementations which do
> > not have equal smarts) and they are largely seen as the value add
> > resulting from significant development/test efforts. Why would I want
> > to give that knowledge away for free when I might use it to gain a
> > competitive advantage?
>
> Router vendors do not.
> Operators probably do want everything written down.
> For themselves and to ensure inter-operability.
>
> > Perhaps it is time to revisit that and consider writing some sort of
> > BCP document - but that is a separate discussion.
>
> If the people who want this knowledge shared, can find someone who can
> has that knowledge, and who wants to write it down, they should go
> ahead. :)
>
> The thing that surprises me a little that a lot of these protocol
> enhancements (or even brand new protocols) would not be necessary
> if all implementations were perfect. It seems that in some cases we
> try to improve networks not by improving equipment, but by trying
> just a slightly different way to do things.
>
> Again, if people think that LSVR is a good idea, then how can they
> think that ISIS flooding over TCP is not a good idea ? This is
> the base idea for our proposal. A quick look at the LSVR draft
> show people from Cisco, Nokia (and Arrcus). (I'm not sure what
> Juniper or Arista or other vendors think about using BGP-LS).
>
> > [Les:] I do not advocate "just send an CSNP and be done with it" for
> > the very reasons you mention.
> > You will see that
> > https://tools.ietf.org/html/draft-li-lsr-dynamic-flooding-01#section-6.7.2
> > specifies that normal LSPDB synchronization occurs on adjacency
> > bringup.
>
> That test mentioned enabling (and thus also disabling) flooding over
> specific
> adjacencies. When you (re-)enable flooding over an adjacency, you need
> to do the synchronization over again. So you need to exchange CSNPs and
> transmit any missing LSPs. That's what I meant by writing "and be done
> with it".
> It's not a complex idea. But when CSNPs get dropped, synchronization
> becomes
> more expensive (because you transmit LSPs which might not have been
> necessary).
> I hope using TCP will make synchronization in this case more efficient.
>
> henk.
_______________________________________________
Lsr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/lsr