Re: [rrg] IRON-RANGER scalability and support for packets from non-upgradednetworks

Templin, Fred L Wed, 17 Mar 2010 14:56:57 -0700

Hi Robin,

Responding again:


> -----Original Message-----
> From: Robin Whittle [mailto:r...@firstpr.com.au]
> Sent: Tuesday, March 16, 2010 5:30 PM
> To: RRG
> Cc: Templin, Fred L
> Subject: Re: [rrg] IRON-RANGER scalability and support for packets from 
> non-upgradednetworks
>
> Short version:   Further discussion on "DITR-like" routers, now
>                  called "IRON Default Mappers" (IDMs), how an IRON
>                  router registers an end-user network EID prefix
>                  with multiple VP routers, and whether to use OSPF
>                  rather than BGP for the IRON overlay network.
>
>
> Hi Fred,
>
> You wrote:
>
> >> OK - so these are the I-R equivalents of Ivip's DITRs (Default ITRs
> >> in the DFZ) and LISP PTRs.  In my previous message, I assumed that VP
> >> routers were also advertising their VPs in the DFZ.  I recall I got
> >> this from something you wrote, but it doesn't matter now.
> >>
> >> But what are the scaling properties of these routers I will refer to
> >> as being "DITR-like"?
> >>
> >> Who runs them?  They are doing work, handling packets addressed to
> >> very large numbers of I-R end-user network prefixes, who are the
> >> parties which benefit.  So I think there needs to be an arrangement
> >> for money to flow from those end-user networks, in rough proportion
> >> to the traffic each DITR-like router handles for each end-user
> >> network.  This is handled in Ivip, but with DITRs which advertise
> >> specific subsets of the MABs (Mapped Address Blocks):
> >>
> >>   http://tools.ietf.org/html/draft-whittle-ivip-arch-04#section-8.1.2
> >>
> >> I suggest you devise a business case for these "DITR-like" routers -
> >> and give them a name.
> >
> > The more I think about it, the more these specialized
> > VP routers are really just Default Mappers, i.e., the
> > similar to those discussed in APT. On the IRON, they
> > advertise "default", and on the DFZ they advertise one
> > or a few short prefixes (e.g., 4000::/3) that cover all
> > of the VPs in use on the IRON.
>
> This is different from what I understood from your previous message.
>
> I understood there was a subset of IRON routers which we call "VP
> routers".  Each such "VP router" advertises in the IRON network (the
> tunnel-based overlay network between all IRON routers, currently
> implemented with BGP) a VP (Virtual Prefix).  There are typically two
> or perhaps more such routers advertising a given VP.  Each such VP
> router may also advertise other VPs, but for this discussion, let's
> think of IRON routers A, B and C all advertising VP "P" on the IRON
> network - and for the purposes of discussion not advertising any
> other VPs.

OK.

> After your msg06274, when I wrote msg06278, I understood that the VP
> routers also advertised their VPs in the DFZ, and that this was the
> mechanism by which I-R supported packets sent by hosts in
> non-upgraded networks.  It doesn't matter now why I thought this.
>
> From the most recent pair of messages, your msg06305 and my msg06315,
> I thought that this role, which I described as "DITR-like" was
> performed by a subset of IRON routers which advertise one or a few
> prefixes in the DFZ which covers the entire I-R "edge" subset of the
> global unicast address space.  This was on the basis of your:
>
>    >> Firstly, if these VP-advertising routers are to operate
>    >> properly like DITRs or PTRs, there needs to be a lot more than
>    >> 2 of them per VP.
>    >
>    > No, because all that needs to be injected into the DFZ is
>    > one or a few very short prefixes (e.g., 4000::/3). It doesn't
>    > matter then which IRON router is chosen as the egress to get
>    > off of the DFZ, since that router will also have visibility
>    > to all VPs on the IRON.
>
>    > Again, DFZ routers on the non-upgraded network would select
>    > the closest IRON router that advertises, e.g., 4000::/3 as
>    > the router that can get off the DFZ and onto the IRON. So,
>    > it would not be the case that all VPs would be injected into
>    > the DFZ.
>
> I assumed that these "DITR-like" routers were not necessarily VP routers.

Correct; these routers (IDMs) may also be VP routers on
the IRON but need not be. So, we have three classes of
IRON routers: 1) VP routers, 2) IDMs, and 3) both.

> Here is my understanding on what you just wrote:
>
> > The more I think about it, the more these specialized
> > VP routers
>
> I think you mean the "DITR-like" routers are VP routers. Later you
> refer to these as "IRON Default Mappers (IDMs)".  I had assumed they
> either were not VP routers, or that they need not be VP routers.

The latter - IDMs need not also be VP routers, but they
could be.

> > are really just Default Mappers, i.e., the
> > similar to those discussed in APT. On the IRON, they
> > advertise "default", and on the DFZ they advertise one
> > or a few short prefixes (e.g., 4000::/3) that cover all
> > of the VPs in use on the IRON.
>
> APT's DMs certainly advertised into their routing systems of the
> networks they were located within.  I recall they advertised a set of
> prefixes covering all the "edge" end-user (EID) prefixes of any
> end-user network which was using an ISP in the same APT island.
> There could be multiple APT islands - sets of APT-adopting ISPs which
> were linked by direct BGP links and therefore which were able to
> share all their mapping information, which was carried over those
> direct BGP links.   (If this was extended with tunnels, then all APT
> adopting ISPs would be part of a single global APT island, and this
> would enable EID space to be split more finely than the IPv4 /24
> limit.  With separate islands, any EID prefixes longer than /24 would
> need to use the same island.)
>
> I recall that the DMs also advertised these covering prefixes to
> neighbouring ISPs - AKA "advertising them in the DFZ".  But this
> assumes the DMs were border routers, which I recall was not
> necessarily the case.   So based on what I remember about APT, in an
> single APT island, the subset of DMs which were BRs would act in much
> the same way as LISP's PTRs or Ivip's DITRs, except that with Ivip's
> DITRs each such DITR normally only advertises a subset of the total
> Ivip "edge" space, while these APT DMs would advertise it all.
>
> If there was a single global APT island, then all the DMs which were
> BRs would advertise in the DFZ the complete set of APT "edge" space.
>  I understand from what you just wrote that "these specialised VP
> routers" (IDMs, below) in I-R are also BRs and that each one also
> advertises the complete set of "edge" address space in the I-R system.
>
> However, this part:
>
> > On the IRON, they advertise "default"
>
> makes no sense to me.  I don't recall any IRON router advertising
> "default" on the IRON overlay network.  I understand that a VP router
> advertises its one or more VPs.

Yes; this is new. By having the IDMs connected to the DFZ
advertise "default" on the IRON, other IRON routers that do
not connect to the DFZ can discover a nearby IDM that can
reach the non-upgraded IPv6 Internet.

> >> They are going to be busy, depending on where they are located, the
> >> traffic patterns, how many of them there are etc.   So they need to
> >> be able to handle the cached mapping of some potentially large number
> >> of I-R end-user network prefixes.
> >
> > In the case of IPv6, I think whether the IRON Default
> > Mappers (IDMs) will be very busy depends on how large
> > the IPv6 DFZ becomes. In my understanding, the IPv6 DFZ
> > is not very big yet. So, if most IPv6 growth occurs in
> > the IRON and not in the IPv6 DFZ the packet forwarding
> > load on the IDMs might not be so great.
>
> This would only be true if you could convince most networks adopting
> IPv6 to adopt I-R at the same time.

Well, now is the time to put forward the case for
handling new IPv6 growth in the IRON instead of in
the IPv6 DFZ. Otherwise, once growth in the IPv6
DFZ takes off and we start to see significant PI
addressing and multihoming, we will eventually
end up in the same boat we are in with the IPv4
DFZ today.

> >>>> wrote earlier, assuming VP routers advertise the VP in the DFZ, not
> >>>> just in the I-R overlay network, then they are acting like LISP PTRs
> >>>> or Ivip DITRs.  In order for them to do this in a manner which
> >>>> generally reduces the path length from sending host, via VP router to
> >>>> the IRON router which delivers the packet to the destination, I think
> >>>> that for each VP something like 20 or more IRON routers need to be
> >>>> advertising the same VP.
> >>>
> >>> No; those IRON routers that also connect to the DFZ
> >>> advertise very short prefixes into the DFZ; they do
> >>> not advertise each individual VP into the DFZ else
> >>> there would be no routing scaling suppression gain.
> >>
> >> I think there would be, since each VP covers multiple individual
> >> end-user network prefixes.  If there are 10^7 of these prefixes, and
> >> on average each VP covers 100 of them, then there are 10^5 VPs and we
> >> have excellent routing scalability, saving 9.9 million prefixes from
> >> being advertised in the DFZ while providing 10 million prefixes for
> >> end-user networks who use them to achieve portability, multihoming
> >> and inbound TE.
> >
> > That's good, but I think I'd still rather have the
> > IDMs only advertise the highly-aggregated short prefixes.
>
> OK.
>
>
> >>>> I interpret your previous sentence to mean that all the IRON routers
> >>>> are part of the IRON BGP overlay network, and that each one will
> >>>> therefore get a single best path for each VP.  That will give it the
> >>>> IP address of one IRON router which handles this VP.  It won't give
> >>>> it any information on the full set of IRON routers which handle this VP.
> >>>
> >>> Here, it could be that my cursory understanding of BGP
> >>> is not matching well with reality. Let's say IRON routers
> >>> A and B both advertise VP1. Then, for any IRON router C,
> >>> C needs to learn that VP1 is reachable through both A and
> >>> B. I was hoping this could be done with BGP, but I believe
> >>> this could only happen if BGP supported an NBMA link model
> >>> and could push next hop information along with advertised
> >>> VPs. Do you know whether this arrangement could be realized
> >>> using standard BGP?
> >>
> >> Sorry, I can't reliably tell you what can and can't be done with BGP
> >> - I don't try to do anything special with it with Ivip.
> >>
> >> Still, if you assume that something could be done with BGP, consider
> >> the potential scaling problems.  Somehow, for every one of X VPs, and
> >>  for every Y IRON routers which handles a given VP, then you want
> >> each IRON router to learn via BGP the address of every one of these
> >> VP-advertising routers, and which VPs each one advertises.  This is
> >> (X * Y) items of information you are expecting BGP to deliver to
> >> every IRON router - so every BGP router needs to handle this
> >> information.
> >>
> >> The scaling properties of this would depend on how you get BGP to do
> >> it, and how many VPs there are, and how many IRON routers advertise
> >> the same VP.
> >>
> >>
> >>> If we are expecting too much with BGP, then I believe we can
> >>> turn to OSPF or some other dynamic routing protocol that
> >>> supports an NBMA link model. In discussions with colleagues,
> >>> we believe that the example arrangement I cited above can
> >>> be achieved with OSPF.
> >>
> >> OK . . . so you are considering using OSPF on the I-R overlay network
> >> rather than BGP.  I can't discuss that without doing a lot of reading
> >> - which I am not inclined to do.  But see below where I propose
> >> methods of doing the registration within the limits imposed by BGP.
> >
> > I will think about both routing alternatives more. But, if
> > we use OSPF in the IRON overlay, routing would work the same
> > way as at any other layer of RANGER recursion. The list of
> > IDMs could be kept in the DNS under the special domain name
> > "isatapv2.net" which I have set aside for this purpose. All
> > other IRON routers can discover the list of IDMs by simply
> > resolving the name "isatapv2.net".
>
> OK.
>
>
> > The term "bubbles" came from teredo (RFC4380). Maybe we can
> > think of a better term to use for IRON-RANGER?
>
> OK.  I don't think "bubbles" is appropriate for the registration
> methods you have described so far, or that I have suggested.

OK. How about Channel Queries (CQs)?

> >> I am definitely not going to try to think about mixed IPv4/v6
> >> implementations of I-R.  I can handle thinking about purely IPv4 and
> >> purely IPv6.
> >
> > I choose to think of mixed IPv4/IPv6 for at least three
> > reasons:
> >
> > 1) We already have global deployment of IPv4, and that won't
> >    go away overnight when IPv6 begins to deploy.
>
> I agree.
>
> > 2) IPv4 is fully built-out, so new growth will come via IPv6.
>
> I don't agree with this at all.  I think there's plenty of scope for
> more growth in the IPv4 Internet.  Fig. 11 at:
>
>   http://www.potaroo.net/tools/ipv4/
>
> shows 130 /8s worth of space is currently advertised.  Fig. 5 shows
> this in more detail.  Of the /8s to to 223, a handful can't be used
> (127, 0 maybe).  There are still a bunch of /8s which are
> unadvertised.  As time progresses, this space will be too valuable to
> use internally, probably inefficiently - so I expect quite a lot of
> that will be made available and advertised too.

OK, but how bad would it be if we just let IPv4 address
depletion run out under the current system, then jack up
to IPv6 in parallel to handle PI addressing and multihoming?

> Then there are ways of using space more efficiently, as Ivip, LISP
> and probably IRON-RANGER could do, by slicing and dicing it into much
> smaller chunks than is possible with the /24 limit on prefixes in the
> DFZ.

OK.

> I think that most growth in Internet usage will occur in the IPv4
> Internet for at least the rest of this decade.  The only time it
> would make sense to use IPv6 instead of direct IPv4 or IPv4 behind
> NAT would be for some service where it wasn't important to be able to
> connect to IPv4.  At present, you couldn't sell any such service. I
> guess that it may be possible to do this for large IP cell-phone
> deployments where there are enough IPv6 services available to do a
> reasonable subset of what people want in a hand-held device, and
> where tunneling to a server which provides behind-NAT IPv4
> connectivity would also be possible.

I agree that the IPv4 Internet is not only not going away
but also continuing to grow. But, I still think that users
will want to have both IPv4 (behind NAT if necessary) and
IPv6 as we move forward from here.

> > 3) IPv6 addresses can embed IPv4 addresses such that there
> >    is stateless address mapping between an EID nexthop and
> >    an RLOC.
>
> Can you explain this with an example?  I can't clearly envisage what
> you mean.

I mean, if the IPv6 EID FIB includes entries with a next-hop
address such as: 'fe80::5efe:V4ADDR' (i.e., an IPv6 address
with embedded IPv4 address), then V4ADDR can be statelessly
extracted as the RLOC address of the ETR.

> If I am to keep up with mixed IPv4/IPv6 IRON-RANGER, you will need to
> explain things with detailed examples.
>
>
> >> Since you have what to me is a new "DITR-like" router plan for
> >> supporting packets send from non-upgraded network, there is no need
> >> for the larger number of VP routers as I assumed in my previous
> >> message.  As long as you have two or three, that should be fine, I think.
> >>
> >> There are two reasons an IRON router M might need to know about which
> >> other IRON routers A, B and C advertise a given VP:
> >>
> >>  1 - When M has a traffic packet.  (M is either an ordinary IRON
> >>      router and advertises the I-R "edge" space in its own network
> >>      or it is a "DITR-like" router advertising this space in the
> >>      DFZ.)  M needs to tunnel the packet to one of these VP routers.
> >>
> >>      The VP router will tunnel it to the IRON router Z it chooses as
> >>      the best one to deliver the packet to the destination network
> >>      and will send a "mapping" packet to M which will cache this
> >>      information and from then on tunnel packets matching the
> >>      end-user network prefix in the "mapping" to Z (or some other
> >>      IRON router like Z, if there were two or more in the "mapping").
> >>
> >>      In this case, M needs only the address of one of the A, B or C
> >>      routers.  Ideally it would have the address of the closest one -
> >>      but it doesn't matter too much if it has the address of a more
> >>      distant one.  That would involve a somewhat longer trip to the
> >>      VP router, and perhaps a longer or shorter trip from there to Z.
> >>      (This would typically be shorter than the path taken through
> >>      LISP-ALT's overlay network.)
> >>
> >>      After M gets the "mapping", it tunnels traffic packets to Z - so
> >>      the distance to the VP router no longer affects the path of
> >>      traffic packets.
> >>
> >>      In this case, BGP on the overlay would be perfectly good - since
> >>      it provides the best path to one of A, B or C - typically that
> >>      of the "closest" (in BGP terms).
> >>
> >>
> >>  2 - When M is one of potentially multiple IRON routers which
> >>      delivers packets to a given end-user network - packets whose
> >>      destination address matches a given end-user network prefix P.
> >>
> >>      M needs to "blow bubbles" (highly technical term from this
> >>      R&D phase of IRON-RANGER) to A, B and C.  The most obvious
> >>      way to do this is for M to be able to know, via the overlay
> >>      network the addresses of all VP routers which advertise a given
> >>      VP.  There may be two or three or a few more of these.  They
> >>      could be anywhere in the world.
> >>
> >>      BGP does not appear to be a suitable mechanism for this, since
> >>      its "best path" basic functions would only provide M with
> >>      the IP address of one of A, B and C.
> >>
> >>      You could do it with BGP, by having A, B and C all know about
> >>      each other, and with all three sending everything they get to
> >>      the others.  This is not too bad in scaling terms for two,
> >>      three of four such VP routers.
> >>
> >>      Then, M sends its registration to one of them - whichever it
> >>      gets the address of via the BGP of the overlay network - and
> >>      A, B and C compare notes so they all get the registration.
> >>
> >>      I will call this the "VP router flooding system".
> >
> > This is a nice idea. If I get what you are suggesting, each
> > IRON router that advertises the same VP (e.g., VP(x)) would
> > need to engage in a routing protocol instance with one
> > another to track all of the PI prefix registrations. The
> > problem I have with it is that that would make for perhaps
> > 10^5 or more of these little routing protocol instances as
> > well as lots and lots of manually-configured peering
> > arrangements between the IRON routers that advertise VP(x).
>
> Something like this - but I am not sure what you mean by "routing
> protocol instance".  I understand that the two or three VP routers
> for any one VP "P" do need to cooperate and share their various
> registrations.  You could either create a fresh protocol to do this,
> or push into service some existing protocol, including perhaps a
> routing protocol.

We haven't brought the Virtual Router Redundancy Protocol (VRRP)
into discussion yet [RFC5798], but we might want to consider
looking at this as a way of providing fault tolerance for VP
routers. I'm not sure whether VRRP would also support load
balancing between the multiple routers, but it seems like
fault tolerance is the dominant consideration.

Using VRRP also reduces the "fanout" of VP-advertising routers
to just a single RLOC address, and so makes for less complexity
in ferrying CQs around the IRON.

> You haven't specified anything other than manual configuration for
> how an IRON router becomes a VP router.  VP routers have extra
> workload, so whoever runs such a router must have a reason to do
> this, probably involving payment of money in some way from the
> end-user networks whose EID prefixes are covered by this VP.

Yes. End-users have to pay either a one-time or
recurring cost for their PI prefixes.

> If there are two or three IRON routers acting as VP routers for a
> given VP, then some organisation is responsible for that VP, is
> collecting payments as described above and is therefore the one
> organisation driving the existence of these two or three VP routers.
>  So manual configuration seems OK to me - I don't think there needs
> to be a fancy automated system by which one VP router for a given VP
> "P" would auto-discover any other VP router for "P" in the whole I-R
> system.  However, these VP routers for the one VP do need to work
> together to share registrations, and to quickly detect when one or
> more of the set becomes unreachable.

VRRP maybe?

> > For these reasons, I believe it is better for IRON router
> > M to know about all three of A, B and C and direct bubbles
> > to each of them. I think we can achieve this using OSPF
> > with the NBMA link model in the IRON overlay.
>
> OK - but I guess that means not running BGP.  I don't know anything
> about OSPF or its scaling properties.  BGP has no central
> coordination - something which is understandably attractive to many
> people.  Does OSPF have central coordination, single points of
> failure etc.?

In this case, central coordination would be through
maintenance of the domainname-to-RLOC mappings for
the FQDN "isatapv2.net". In other words, when a new
IDM comes into existence its RLOC address gets added
to the DNS RR's for "isatapv2.net". In the same way,
when an existing IDM is decommissioned its RLOC address
is removed.

Currently, "isatapv2.net" is registered to me. Do you
trust me to maintain it properly? :^}

> > Please note: the EID-based IRON overlay is configured over
> > the DFZ, which is using BGP to disseminate RLOC-based
> > prefix information. So, it is BGP in the underlay and
> > OSPF in the overlay - weird, but I think it works.
>
> Yes the DFZ uses BGP and the overlay uses . . . originally I-R used
> BGP (a separate instance of BGP in each such router).  Also, IRON
> routers don't need to be DFZ routers and in many or most cases are
> not DFZ (BR) routers - but they all communicate via tunnels which are
> carried between networks via the ordinary Internet (using the DFZ).
>
> I guess these tunnels between IRON routers will need to be manually
> configured, since they are typically between physically and
> topologically nearby routers.

No manual config needed; the IRON is just a gigantic NBMA
link, and can use automatic tunneling the same as for VET
and ISATAP.

> >>>> Also, this is just for 10 minute registrations.  I recall that the 10
> >>>> minute time is directly related to the worst-case (10 minute) and
> >>>> average (5 minute) multihoming service restoration time, as per our
> >>>> previous discussions.  I think that these are rather long times.
> >>>
> >>> Well, let's touch on this a moment. The real mechanism
> >>> used for multihoming service restoration is Neighbor
> >>> Unreachability Detection. Neighbor Unreachability
> >>> Detection uses "hints of forward progress" to tell if
> >>> a neighbor has gone unreachable, and uses a default
> >>> staletime of 30sec after which a reachability probe
> >>> must be sent. This staletime can be cranked down even
> >>> further if there needs to be a more timely response to
> >>> path failure. This means that the PI prefix-refreshing
> >>> "bubbles" can be spaced out much longer - perhaps 1 every
> >>> 10hrs instead of 10min. (Maybe even 1 every 10 days!)
> >>
> >> OK, I am not sure if I ever knew the details of "Neighbor
> >> Unreachability Detection" - but shortening the time for these
> >> mechanisms raises its own scaling problems.
> >>
> >> Can you give some examples of how this would work?
> >
> > I want to go back on this notion of extended inter-bubble
> > intervals, and return to something shorter like 600sec
> > or even 60sec. There needs to be a timely flow of bubbles
> > in case one or a few IRON routers goes down and needs to
> > have its PI prefix registrations refreshed.
>
> OK - I will stay tuned for further details.

Bringing VRRP into the consideration could have a
contributing factor to how long the bubble (er, CQ)
interval needs to be.

> >> At present, I can see these choices for this registration mechanism:
> >>
> >>   1 - Keep BGP as the overlay protocol and use my proposed "VP router
> >>       flooding system".
> >>
> >>   2 - Retain your current plan of each IRON router like M needing to
> >>       know the addresses of all the routers handing a given VP (A, B
> >>       and C) which BGP can't do.  So you could:
> >>
> >>       2a - keep BGP and add some other mechanism.  Maybe M sends a
> >>            message to the one of A, B or C it has a best path to,
> >>            requesting the full list of all routers A, B and C which
> >>            handle a given VP.  When M gets the list, it sends
> >>            registration "bubbles" to the routers on the list.  This
> >>            needs to be repeated from time-to-time to discover
> >>            new VP routers.
> >>
> >>       2b - use something different from BGP which provides all the
> >>            A, B and C router addresses to every IRON router, such as
> >>            M.  This needs to dynamically change as A, B and C die and
> >>            are restarted, or joined by others.
> >
> > Right - I am still leaning toward OSPF with its NBMA
> > link model capabilities. The good news is that the
> > IRON topology itself should be relatively stable, so
> > not much churn due to dynamic updates.
>
> OK.  Since the IRON routers have their own IP addresses and are
> generally in networks multihomed by existing BGP techniques, then any
> outages don't affect the IRON routers' IP addresses or their
> tunneling arrangements.  There would still be transitory breaks in
> connectivity, before the BGP multihoming arrangements kick in.  If
> you could ignore those by some means in the overlay's routing system
> (BGP or OSPF) then yes, the IRON routers should be pretty stable.

With VRRP, probably even moreso.

> >> OK - but you still need to design a registration mechanism before we
> >> can think in detail about scaling.
> >
> > Let's forget about the DHCP lease lifetimes analogy
> > for a bit and get back onto the assumption that the
> > inter-bubble interval is the mechanism that keeps
> > PI registrations refreshed in a timely fashion.
>
> OK.

Thanks - Fred
fred.l.temp...@boeing.com

>   - Robin
_______________________________________________
rrg mailing list
rrg@irtf.org
http://www.irtf.org/mailman/listinfo/rrg

Re: [rrg] IRON-RANGER scalability and support for packets from non-upgradednetworks

Reply via email to