Re: [rrg] IRON-RANGER scalability and support for packets from non-upgradednetworks

Templin, Fred L Mon, 15 Mar 2010 11:24:03 -0700

Robin,

See below for some follow-up:


> -----Original Message-----
> From: Robin Whittle [mailto:r...@firstpr.com.au]
> Sent: Friday, March 12, 2010 7:11 PM
> To: RRG
> Cc: Templin, Fred L
> Subject: Re: [rrg] IRON-RANGER scalability and support for packets from 
> non-upgradednetworks
>
> Short version:    Exploring the scalability of IRON-RANGER's
>                   "bubble"-based registration system - every
>                   10 minutes the two IRON routers of the two
>                   ISPs send a registration packet to however
>                   many VP (Virtual Prefix) Iron routers there
>                   are for the VP which covers the I-R PI
>                   prefix in question.
>
>                   I think the scaling properties of this
>                   system look bad - and I can't yet see how
>                   the IRON routers can discover the IP addresses
>                   of all the VP routers.
>
>
> Hi Fred,
>
> You wrote:
>
> >>> IRON-RANGER used to speak of using IPv6 neighbour discovery
> >>> as the means for locator liveness testing, dissemination
> >>> of routing information, secure redirection, etc. However,
> >>> the VET and SEAL mechanisms are being revised to instead
> >>> use a different mechanism called the SEAL Control Message
> >>> Protocol (SCMP) for tunnel endpoint negotiations that occur
> >>> *within* the tunnel sublayer and are therefore not visible
> >>> to either the outer IP protocol nor the inner network layer
> >>> protocol. Hence, the inner network layer protocol could be
> >>> anything, including IPv4, IPv6, OSI CLNP, or any other network
> >>> layer protocol that is eligible for encapsulation in IP.
> >>
> >> OK.  I hope you will be able to explain these things not just in
> >> terms of high-level concepts, but to give examples of how the whole
> >> thing would actually work on a large scale.
> >
> > OK if you are talking about an architectural description,
> > but please note that both VET and SEAL are already full
> > functional specifications that can be used by software
> > developers to produce real code.
>
> I think I-R needs to be described in a way that someone who is up to
> speed on scalable routing in general can read one or perhaps two I-R
> documents and have a good idea of how the whole thing is going to
> work - including with respect to scaling and security.  This doesn't
> require exact bits in headers, but that could be part of it.  I think
>  it needs to be pretty-much self-contained rather than requiring
> people to read other documents which are not part of I-R.

There is room in a future update to IRON to improve on this.

> >> For instance, how many IRON routers are there in an IPv4 I-R system,
> >> and how many individual EID prefixes?
> >
> > Let's suppose that each VP is an IPv6 ::/32, and that
> > the smallest unit of PI prefix delegation from a VP is
> > an IPv6 ::/56. In that case, there can theoretically be
> > up to 4B VPs in the IRON RIB and 16M PI prefixes per VP.
> > In practice, however, we can expect to see far fewer than
> > that until the IPv6 address space reaches exhaustion
> > which many believe will be well beyond our lifetimes.
>
> OK.  Still, depending on how the address space was allocated - or at
> least that subset of the address space covered by I-R's VPs - there
> could be high numbers, approaching 16M perhaps, of I-R PI prefixes
> per VP.

Well, this is a tunable knob of course. We could for
example set the length for VPs to ::/36, ::/40, etc.
to reduce the number of PI prefixes per VP.

The tradeoff is in managing a RIB containing a large
number of VPs (which are likely to be quite stable) vs.
managing a large number of PI prefixes per VP (which
require periodic keepalives to maintain). So, given a
routing protocol that can maintain a large number of
VPs in a relatively static topology it seems like a
proper balance of PI prefixes per VP can be found.

> > Still thinking (very) big, let's try sizing the system
> > for 100K VPs; each with 100K ::/56 delegated PI prefixes.
> > That would give 10B ::/56 PI prefixes, or 1 PI prefix
> > for every person on earth (depending on when you sample
> > the earth's population). Let's look at the scaling
> > considerations under these parameters:
>
> OK, I think this is a good scenario to discuss.  I assume that the
> VPs can be of various sizes, so some VPs could be a longer prefix,
> covering less space, if there are a larger number of I-R PI prefixes
> within that part of the address space.

The length of the VPs is a tunable. It may be that there
can be VPs of varying lengths, but I chose to discuss as
all VPs having the same length for simplicity.

> As far as I know, you don't need VPs covering the entire advertised
> subset of global unicast address space.  However, for worst-case
> scaling discussions I think it is good to assume this.
>
>
> >> Then, how do these IRON
> >> routers, for each of these EID prefixes continually and repeatedly (I
> >> guess every 10 minutes or less) securely inform a given number of VP
> >> routers they are the router, or one of the routers, to which packets
> >> matching a given EID prefix should be tunneled.  Since there could be
> >> multiple VP routers for a given VP, and the IRON routers don't and (I
> >> think) can't know where they are, how does this process work securely
> >> and scalably?
> >
> > Each IRON router R(i) discovers the full map of VPs in
> > the IRON through participation in the IRON BGP.
>
> I recall that some IRON routers handle VPs and others don't.  As I

Not quite. All IRON routers by definition connect to the
IRON. So, all IRON routers discover all VPs in the IRON,
and *some* IRON routers also connect to the DFZ. Those
that connect to the DFZ advertise one or a few very short
prefixes (e.g., 4000::/3) that cover the set of all VPs
in the IRON.

> wrote earlier, assuming VP routers advertise the VP in the DFZ, not
> just in the I-R overlay network, then they are acting like LISP PTRs
> or Ivip DITRs.  In order for them to do this in a manner which
> generally reduces the path length from sending host, via VP router to
> the IRON router which delivers the packet to the destination, I think
> that for each VP something like 20 or more IRON routers need to be
> advertising the same VP.

No; those IRON routers that also connect to the DFZ
advertise very short prefixes into the DFZ; they do
not advertise each individual VP into the DFZ else
there would be no routing scaling suppression gain.

> I interpret your previous sentence to mean that all the IRON routers
> are part of the IRON BGP overlay network, and that each one will
> therefore get a single best path for each VP.  That will give it the
> IP address of one IRON router which handles this VP.  It won't give
> it any information on the full set of IRON routers which handle this VP.

Here, it could be that my cursory understanding of BGP
is not matching well with reality. Let's say IRON routers
A and B both advertise VP1. Then, for any IRON router C,
C needs to learn that VP1 is reachable through both A and
B. I was hoping this could be done with BGP, but I believe
this could only happen if BGP supported an NBMA link model
and could push next hop information along with advertised
VPs. Do you know whether this arrangement could be realized
using standard BGP?

If we are expecting too much with BGP, then I believe we can
turn to OSPF or some other dynamic routing protocol that
supports an NBMA link model. In discussions with colleagues,
we believe that the example arrangement I cited above can
be achieved with OSPF.

> > That
> > means that each R(i) would need to perform full database
> > synchronization for 100K stable IRON RIB entries that rarely
> > if ever change.
>
> I am not sure what you mean by "full database synchronization".  Only
> a subset of IRON routers advertise a VP, and each IRON router would
> get a best-path to a single IRON router out of potentially numerous
> IRON routers which were advertising a given VP.  So any one IRON
> router would not be able to use the IRON BGP overlay system to either
> discover the IP addresses (or best paths) to all IRON routers, or to
> all the IRON routers which advertise VPs, assuming that some VPs were
> advertised by more than one IRON router.

What we need here is a dynamic routing protocol that
supports an NBMA link model, and the IRON is treated
as a gigantic NBMA link on which all IRON routers are
attached. Maybe BGP won't fill the bill for that, but
other dynamic routing protocols such as OSPF show some
promise.

> > This doesn't sound terrible even for existing
> > core router equipment. As you noted, it is also possible that
> > a given VP(j) would be advertised by multiple R(i)s - let's
> > say each VP(j) is advertised by 2 R(i)s (call them R(x) and
> > R(y)). But, since the IRON RIB is fully populated to all
> > R(i)s, each R(i) would discover both R(x) and R(y) that
> > advertise VP(j).
>
> I don't see how this would occur.  A given IRON router receives best
> paths for each VP, so for VP(j) it will get a best path to (and IP
> address of) either R(x) or R(y).

As above.

> > Now, for IRON router R(i) that is the provider for 100K PI
> > prefixes delegated from VP(j), R(i) needs to send a "bubble"
> > to both R(x) and R(y) for each PI prefix.
>
> Its no-doubt a relief to less muscle-bound scalable routing
> architectures that the routers of IRON-RANGER are hurling about
> merely "bubbles" rather than something with greater impact!

No worries; they are harmless, and not at all weapons
of war.

> > That would amount to 200K bubbles every 600 sec, or 333
> > bubbles/sec.  If each bubble is 100bytes, the total bandwidth
> > required for updating all of the 100K PI prefixes is 260Kbps.
>
> I am not sure each registration "bubble" would only be 100 bytes of
> protocol-level data.  You need to specify, for IPv6:
>
>   1 - The IP address of the IRON sending the registration (16 bytes).

You mean in the data portion of the bubble or in the header?
For IPv6-over-IPv4, the bubble does not need to include an IPv6
header; it need only include the IPv4 header, since VET stateless
address mapping allows the IPv6 link-local address to be discovered
by knowing only the IPv4 address. I can't see why an IPv6 address
would also be required in the data portion of the bubble if it can
already be inferred from the IPv4 header?

>   2 - The prefix the IRON router is registering (18 bytes).

Not necessarily 18 bytes; prefix plus length is all that
is needed. For a ::/32, that would be 4 bytes of prefix
plus 1 length byte = 5 bytes. Since IPv6 likes to do
things in blocks of 8, however, let's round up to 16
to be safe.

>   3 - Nonces and other stuff which invariably accompany messages
>       such as this (10 to 20 bytes?).

The SEAL header with a sequence number that also
serves as a nonce is used for this - the SEAL
header plus sequence number length is accounted
for below:

>   4 - Authentication material, such as a digital signature for the
>       above, including the public key of the signer (the
>       IRON router itself?) and a pointer to one or more PKI CAs or
>       whatever so the VP router can ascertain that this really is
>       the public key of the signer.  These will be FQDNs - lets
>       say 50 bytes or so.

I honestly do not know how much this would be. I will
take your 50 byte estimation.

> Maybe you could get the whole thing into 100 bytes.  Then add the
> IPv6 header - 40 bytes - and a UDP header 8 bytes - and we are up to
> about 150 bytes already.

No IPv6 header; only an IPv4 header (20 bytes) plus a SEAL
header (8 bytes) plus possibly also a UDP header (8 bytes)
for a total of 36.

Add in L2 headers - Ethernet is 46 octets -

I guess you are counting everything from the preamble to the
end of the interframe gap? I come up with 42 (when 802.1Q header
is added), but I'll use your 46 to be conservative.

>  and we are up to 200 bytes.  Multiply by 8 and this is 1600 bits.

I have (36 + 16 + 50 + 46) = 148. So, call it 150 to be
safe, and the guesstimate is midway between your 200 and
the 100 I said initially.

>   1600 x 333 = 532,800 bits/sec ~=0.5Mbps

I get 1200 * 333 = 399,600 bps ~=0.4Mbps

> This is the bandwidth of incoming packets to R(x) and likewise for
> R(y) in your description.   This is assuming a two IRON routers
> ("200k bubbles every 600 sec") per I-R PI prefix.
>
> But your description varies from mine already in two other important
> respects.
>
> Firstly, if these VP-advertising routers are to operate properly like
> DITRs or PTRs, they needs to be a lot more than 2 of them per VP.

No, because all that needs to be injected into the DFZ is
one or a few very short prefixes (e.g., 4000::/3). It doesn't
matter then which IRON router is chosen as the egress to get
off of the DFZ, since that router will also have visibility
to all VPs on the IRON.

> Let's say 20.  Maybe 10 would be acceptable, maybe more - but 20 will
> do.  Let's call them RVP(j, 0) to RVP(j, 19) where, in your example:
>
>   R(x) == RVP(j, 0)
>   R(y) == RVP(j, 1)
>
> Secondly, I don't see how R(i) could discover the IP addresses of
> more than one of this set of 20 routers.

As above, it is only 2-3 IRON routers per VP; not 20.

> In my model, if it could be shown how routers such as R(i) which
> handle the 100k I-R PI prefixes in VP(j) could discover all the 20
> routers RVP(j, 0) to RVP(j, 19), then each of these 20 routers has
> this incoming bandwidth.
>
> > Now, let's say that each PI prefix is multihomed to 2 providers,
> > then we get 2x the message traffic for 520Kbps total for the
> > bubbles needed to keep the 100K PI prefixes refreshed.
>
> You already assumed two IRON routers per I-R PI prefix in your
> 260kbps figure above, so there's no need to double at again to 520kbps.
>
> 2 ISPs seems a reasonable figure, which was already part of my
> calculations.
>
> Each provider has an IRON router which handles a given I-R IP prefix,
> and each such IRON router is sending bubbles to all the VP routers
> (though I don't yet understand how these VP routers would be
> discovered - and I am assuming there are 20 of them while you are
> assuming there will be 2 of them).
>
> My figure is 532kbps ~= 0.5Mbps incoming bandwidth per VP router.
>
>
> >> If the VP routers act like DITRs or PTRs by advertising their VP in
> >> the DFZ, then in order to make them work well in this respect - to
> >> generally minimise the extra path length taken to and from them
> >> compared to the path from the sending host to the proper IRON router
> >> - I think you need at least a dozen of them.   This directly drives
> >> the scaling problems in the process just mentioned where the IRON
> >> routers continually register each of their EID prefixes with the
> >> dozen or so VP routers which cover that EID prefix.
> >
> > I don't understand why the dozen - I think with IRON VP
> > routers, the only reason for multiples is for fault tolerance
> > and not for optimal path routing, since path optimization will
> > be coordinated by secure redirection. So, just a couple (or a
> > few) IRON routers per VP should be enough I think?
>
> Secure redirection works when an IRON router sends the initial packet
> to a VP router, but it doesn't apply when the sending router is that
> of a non-upgraded network.  To support generally low stretch paths
> from those sending networks to the IRON router which is currently the
> desired one for forwarding packets to the destination network, I
> think you need a larger number.  20 is a rough figure, assuming a
> global distribution of sending hosts and IRON routers which handle
> the I-R PI prefixes - as is required for real portability.

Again, DFZ routers on the non-upgraded network would select
the closest IRON router that advertises, e.g., 4000::/3 as
the router that can get off the DFZ and onto the IRON. So,
it would not be the case that all VPs would be injected into
the DFZ.

> If all the IRON routers for the I-R PI prefixes of a given VP were in
> Europe, then it would suffice to have all the VP routers also in
> Europe - so depending on the need for robustness and load sharing,
> perhaps you wouldn't need 20 or them.  Maybe 5 would do.  But
> generally, for this kind of scaling discussion, I think we need to
> assume the goal of global portability of the new kind of address
> space, with sending hosts likewise distributed globally.
>
> So I think that for a VP containing 100k I-R PI prefixes, there are
> going to be 20 such VP routers, and each is going to get a continual
> 1Mbps stream of registration packets.

Not 20; only 2 or 3. And, it would be less than 1Mbps per
VP router.

> This is not counting the work that VP router needs to do in order to
> establish the authenticity of those registrations.  As far as I know,
> it could only do this by looking up PKI CAs (Certification
> Authorities) on a regular basis to ensure the signed registrations
> were valid.
>
> There are serious scaling problems per VP router in handling 333
> signed registrations per second. That's a lot of crypto stuff to do
> just to check the signatures - and a lot more work and packets going
> back and forth for regularly checking that the public keys provided
> are still valid.

Crypto overhead can be greatly relaxed if the IRON router
performs crypto only for the initial prefix registration
then accepts bubbles without performing the crypto for
subsequent prefix refreshments. This is because, using
SEAL, there are synchronized sequence numbers for blocking
off-path injections of bogus bubbles.

> There is also the scaling problem of there being 20 or so of these VP
> routers, so the entire Internet needs to handle 20 x 0.5Mbps = 10Mbps
> continually just to handle the registration of these 100k I-R PI
> prefixes.  Each such prefix requires 100 bits per second in continual
> registration activity - 5 bits per second per VP router per I-R PI
> prefix.  For each VP router, 5 bits per second on average comes from
> each of the typically two IRON routers which are registering a given
> I-R PI prefix.
>
> Checking this: If there was a single VP router and a single IRON
> router registering an I-R PI prefix, the IRON router would send 1600
> bits every 600 seconds. This is 2.66 bits a second.  Since there are
> 20 VP routers, the figure per IRON router per I-R PI prefix is 53bps.
>  Since there are two such IRON routers per I-R PI prefix, each such
> IRON router sends 106bps per I-R PI prefix.  With 100k of these I-R
> PI prefixes per VP, this is about 10Mbps.  This checks out OK.

You are off by a factor of 10 here, because there only needs
to be 2 VP routers per VP.

> I think this is an unacceptable continual burden of registration traffic.
>
> Also, this is just for 10 minute registrations.  I recall that the 10
> minute time is directly related to the worst-case (10 minute) and
> average (5 minute) multihoming service restoration time, as per our
> previous discussions.  I think that these are rather long times.

Well, let's touch on this a moment. The real mechanism
used for multihoming service restoration is Neighbor
Unreachability Detection. Neighbor Unreachability
Detection uses "hints of forward progress" to tell if
a neighbor has gone unreachable, and uses a default
staletime of 30sec after which a reachability probe
must be sent. This staletime can be cranked down even
further if there needs to be a more timely response to
path failure. This means that the PI prefix-refreshing
"bubbles" can be spaced out much longer - perhaps 1 every
10hrs instead of 10min. (Maybe even 1 every 10 days!)

In this way, the PI prefix registration process begins
to very much resemble DHCP prefix delegation.

> >> Your IDs tend to be very high level and tend to specify external RFCs
> >> for how you do important functions in I-R.
> >
> > You may be speaking of IRON/RANGER, but the same is not
> > true of VET/SEAL. VET and SEAL are fully functional
> > specifications from which real code can be and has been
> > derived.
>
> Yes - SEAL is a self-contained protocol, but I still found it hard to
> navigate my way within the one document.

The IRON document has a lot of room to add more
descriptive text on the architecture. But, the
mechanisms are already specified in VET and SEAL.

> >> Yet those RFCs say
> >> nothing about I-R itself.  I think your I-Ds generally need more
> >> material telling the reader specifically how you use these processes
> >> in I-R.   Then, for each such process, have a detailed discussion
> >> with real worst-case numbers to show that it is scalable at every
> >> level for some worst-case numbers of EID prefixes, IRON routers etc.
> >> - as well as secure against various kinds of attack.
> >
> > Does the analysis I gave above help? If so, I can put
> > it in the next version of IRON.
>
> This is the sort of example I am hoping you will add.  But first I
> think there are two questions I raised which would need to be
> resolved before your example would be realistic according to my
> understanding of I-R:
>
>   1 - How does an IRON router discover all the IRON routers
>       advertising a VP?  The I-R BGP overlay network does not
>       provide this, as far as I know.

We believe that OSPF with NBMA link model (or equivalent)
could be used.

>   2 - Allow for 20 or so routers each advertising the one VP,
>       for the purposes of supporting packets from non-upgraded
>       networks.

We don't need 20; we only need 2-3. And, the bubble
interval (aka the "lease lifetime" can probably be
pushed out by a factor of ~100.

> Assuming 2 is accepted, and 1 is somehow achieved, we now have, for
> each of the 20 VP routers, 0.5Mbps of registration traffic.  That's a
> lot of traffic and a lot of crypto processing to do.

Crypto is not needed on each and every bubble;
only on the first bubble.

> It is no-doubt more efficient than the ~100k or so extremely
> expensive BGP routers of today's DFZ fussing around comparing notes
> about 300k prefixes.  However, I don't think it scales as well as an
> alternative:
>
>   http://tools.ietf.org/html/draft-whittle-ivip-arch
>   http://tools.ietf.org/html/draft-whittle-ivip-drtm
>
> which doesn't have such continual flows of registration, mapping etc.
> data, unrelated to the traffic flowing to a given micronet, or to
> changes in the ETR to which the micronet is mapped.

I think we have learned a few things about the scaling,
and there are solutions. Consider now the bubble interval
as being analogous to the DHCP lease lifetime, and scaling
can be greatly improved for (much) longer bubble intervals.

> >>>>   8 - Apart from Ivip's Modified Header Forwarding arrangements,
> >>>>       CES architectures involve encapsulation for tunneling
> >>>>       packets from ITRs to ETRs (IRON-RANGER doesn't have ITRs and
> >>>>       ETRs, but it still requires encapsulated tunneling).  There
> >>>>       are some problems with this - but they do not appear to be
> >>>>       prohibitive.
> >>> IRON-RANGER calls them as ITEs/ETEs because it is possible
> >>> to also configure a tunnel endpoint on a host and not just
> >>> on routers. In terms of routers, the IRON-RANGER ITE/ETE
> >>> are exactly equivalent to what the other proposals are
> >>> calling as ITR/ETR.
> >> OK.  In Ivip the sending host can have in "ITR" function - though it
> >> is not a router and this "ITR" function doesn't advertise routes to
> >> the MABs (Mapped Address Blocks) inside the host.  It does however
> >> only handle packets sent by the host's stack which have destination
> >> addresses matching any of the MABs.  I am sticking with "ITR" and
> >> "ETR" in Ivip, to remain compatible with LISP - and because I think
> >> they are easier to pronounce than "ITE" and "ETE".
> >
> > I'm not sure about this - an {Ingress/Egress} Tunnel
> > *Router* is a router that happens to terminate tunnel
> > endpoints. On the other hand, an {Ingress/Egress}
> > Tunnel *Endpoint* is a tautologically a tunnel
> > *endpoint* - so, why not call it as such?
>
> I am not suggesting you adopt "ITR" and "ETR" instead of "ITE" and
> "ETE" - which I agree are more apt terms.  I was just explaining why,
> for now, I will stick with "ITR" and "ETR" for Ivip.

OK - Fred
fred.l.temp...@boeing.com

>   - Robin

_______________________________________________
rrg mailing list
rrg@irtf.org
http://www.irtf.org/mailman/listinfo/rrg

Re: [rrg] IRON-RANGER scalability and support for packets from non-upgradednetworks

Reply via email to