[rrg] Adding a Distance Server for anycast / disaster recovery

Robin Whittle Thu, 23 Apr 2009 01:51:20 -0700

Was: Re: [rrg] Anycast in the core architecture - sep. OK; elim not?

Short version:   We agree: anycast ETRs are of no value.


                 Bill suggests that in a core edge separation
                 scheme (Strategy A) in which there are potentially
                 multiple ETR addresses, an ITR can choose an ETR
                 in a way as to replicate the overall functionality
                 of the current (unscalable) BGP-based anycast
                 arrangement.

                 I argue they can't, since they have no way of
                 knowing the "distance" to the various ETRs.

                 I suggest the concept of a local "Distance Server"
                 which ITRs query to find out which ETR address is
                 "closest".  This gets a full feed of DFZ routes
                 directly or indirectly from a nearby DFZ router.
                 Its answer is the number of ASes in the most
                 specific DFZ route which matches the queried ETR
                 address.

                 Then there will be a scalable way of doing
                 host anycast and the "Disaster Recovery"
                 arrangement.

                 The same idea can be applied to the host stacks
                 in core-edge elimination schemes (Strategy B)
                 - if the mapping has multiple ETR addresses and
                 instructions to pick the closest one, the host
                 finds out which is closest from its local Distance
                 Server.

                 Ivip, as currently conceived, has only a single
                 ETR address for each micronet.  If this was
                 changed to have potentially multiple ETR addresses
                 then this would enable the same thing to occur as
                 just described for LISP, APT and TRRP.  The best
                 approach would be to add the "Distance Server"
                 function to the full database QSD Query Server
                 function.  Then, the QSD can return to the ITR
                 a single ETR address which it chose as the
                 closest of several.  This involves no extra
                 complexity in the ITR or the ITR<-->QSD query
                 protocol.

                 This same idea - integrating a Distance Server
                 - could also be done for APT's DMs and for
                 LISP-ALT's currently optional Map Resolvers.

                 In this light, I will write more in the future
                 about Bill's example of a "Disaster Recovery"
                 arrangement.  I am working on a message about this
                 but will revise it in the light of the "Distance
                 Server" idea.

                 With BGP, and with existing core-edge separation
                 techniques, I can find no scalable way of doing
                 anycast or "Disaster Recovery" arrangements.
                 However, by adding a "Distance Server" to these
                 schemes, I think it could be done elegantly and
                 perfectly scalably.

                 Below my signature is a copy of what Bill wrote in
                 msg04901 about "Disaster Recovery" and my attempt
                 to create a diagram for it.


Hi Bill,

Thanks for your speedy and thoughtful reply.

> I think maybe you're missing the forest for the trees here. The
> functional equivalent of anycast in a strategy-A system is a 1-to-many
> address-to-ETR map where the ETRs in question are entries into
> distinct networks instead of entries into the same network. 

> The anycasting functionality happens entirely in the ITR when it
> chooses the ETR.


You could try to do this in LISP, APT or TRRP, where the mapping for
one EID has multiple ETR addresses, but I foresee practical problems
I will discuss below.  I can't do this with Ivip as currently
conceived, since the mapping for each micronet is a single ETR address.

Still, I agree that anycast and disaster recovery are uses of the
network we would very much like to do in a more scalable fashion than
is currently possible.  With OITRDs / PTRs we can make the core-edge
separation scheme work for packets from all hosts - ideally with the
OITRDs and PTRs widely scattered around the Net and therefore
on-path, or close to on-path with the sending host to ETR path.


> The ETR's character is that it has only one attachment in the
> topology, not multiple attachments. Giving the ETR multiple
> attachments would defeat the purpose.
> 
> As for anycast ETRs, I haven't thought of a single sensible use case.

I agree.  I explored anycast ETRs in my previous messages and
couldn't find a use for them which was any better than anycast hosts
with ordinary BGP.


>>   Core-edge elimination
>>
>>      Could be made to work, but is more complex than the
>>      conventional approach and is just as bad in terms of
>>      scaling.   However, if the Internet was somehow converted
>>      to a core-edge elimination approach, it would be impossible
>>      to do the conventional BGP approach, in which there was no
>>      distinction between identifier and locator (the IP address
>>      means both).
> 
> Strategy B is like strategy A except we replace layer 4 with a
> protocol suite that handles the ITR/ETR functionality internally,
> eliminating the need for ITRs and ETRs. 

I think this is a nifty way of describing it.

A core-edge elimination architecture load every host with many more
functions in its stack, not least crypto stuff to securely establish
other hosts' true identity (Identifier).  Also, AFAIK, no useful work
can be done (at least with HIP) before there have been several packet
exchanges to establish this identity.

What is the equivalent of a single, ad-hoc, UDP packet in HIP?
Doesn't it involve the two hosts fussing around before they can send
a traffic packet, or trust its contents?

I will write a fuller critique later, but wile I think the core-edge
elimination division of labor is conceptually elegant, I think it
places too much state, too much complexity etc. in the host.  Also,
especially for slow, expensive, radio links, the management traffic
setting up these host-to-host associations will be economically
expensive and will be slow, especially with higher packet loss rates
on the radio link.


> The anycasting decision
> happens in the originating host, after which the packet travels (or
> fails to travel) to one specific network attachment associated with
> the destination. The origin may have selected from be different
> attachments for the same host or it could have been attachments for
> multiple hosts all of which are capable of responding to the service
> request.

OK, converting this into other terms, as I think suit a core-edge
elimination scheme:

    The originating host (assume for the moment a unicast host
    = unique Identifier and unique Locator) may have selected from
    different Locators for the various remote hosts which perform
    the same function in a manner analogous to anycast: by having
    those remote hosts all having the same Identifier ...

This requires multiple physical hosts to have the same identifier.
Is this kosher in any core-edge elimination scheme to date?

    or it could have been attachments (Locators) for multiple hosts
    (each host has a unique identifier, so this is not really
    "anycast") all of which are capable of responding to the service
    request.

The second half sounds like some mid-level stack arrangement.  For
instance, the application and higher part of the stack in host A,
(assume unicast: a single host A with a single identifier A) is
communicating with what the stack and application treat as a single,
similar, (implicitly unicast) host B  . . . but which is in fact
implemented at a mid level in the stack by choosing one of multiple
hosts with unique Locators and unique Identifiers, none of which are
"B" but which somehow work together so they behave as a host with
Identifier B.

That sounds like it would require additional complexity in all hosts,
including host A which never itself takes part in a set of "~anycast"
hosts, but may sometimes need to communicate with such a set,
presumably in a way which makes them resemble a single host as far as
applications are concerned.

I am sure it could be done - but it would involve more complexity in
all hosts.



>> The main, perhaps only, reasons people are interested in
>> conventional BGP-based host anycast are (AFAIK) as follows.
>> These are all dependent on the normal behaviour of the BGP
>> routing system.
>>
>> a - "Shortest" (generally, in BGP terms) path to the nearest
>>    router which advertises the prefix of its anycast host.
> 
> This is the "speed" case. The premise is that the closest one will
> have the shortest round trip time and the lowest probability of packet
> loss.

Ahh - yes, not just speed but lower packet loss.


>> b - Automatic failure recovery as long as the router stops
>>    advertising the prefix if the one or more hosts using
>>    this prefix dies.  If so, the other BGP routers will
>>    soon get the packets which would have gone to this one.
> 
> Yes.
> 
>> c - Load sharing over many hosts, in geographically and
>>    topologically diverse sites, which gives the system
>>    a high capacity and a great resistance to failure
>>    without involving DNS in any way, since it always
>>    responds to the one IP address.  This is also extremely
>>    important as a way of achieving high total bandwidth
>>    to survive DoS attacks with floods of incoming packets.
> 
> Yes. Point being that while DNS would work for this in theory, in
> practice it runs into trouble.
> 
>> It may also be desired and possible to:
>>
>> d - Imply something about sending host location from which
>>    of the anycast BGP routers got the packet (as you are
>>    doing with your project which you mentioned at the end
>>    of msg04894) - but AFAIK this information is not used for
>>    the most prominent BGP-based host anycast usage: root
>>    nameservers.
> 
> I suppose that's possible though I'm not sure what the use case is.

OK - thanks for confirming this.


> It seems like much of the rest of what you wrote here is based on the
> faulty premise of having some kind of anycast ETRs. 

I agree they are faulty - I tried pretty hard to find a use for them
and found none.

> The only response
> I can offer is: of course anycast ETRs don't make sense. 

I agree.


> Anycast ITRs make sense.

Yes - Ivip Open ITRs in the DFZ or LISP Proxy Tunnel Routers.


> ITRs making anycast decisions while resolving the map make sense.

They might . . .


> But ETRs are supposed to be single points of attachment in the
> network topology; it wouldn't make any kind of sense for them to be
> anycast.

I haven't found a use for anycast ETRs yet - and I probably won't, so
I tend to agree.  But we are designing these things and we can do
what we like with our own creations.


How would you achieve the goals agreed to above via LISP, APT or TRRP
having the ITR choose between multiple ETR addresses?

Current BGP-based anycast uses the natural behaviour of BGP routers
to cause the packet to be forwarded to the "closest" border router
which advertises the most specific matching prefix.  BGP routers do
this naturally - it is not a special function they work hard to achieve.

Let's forget goal 'd' above for now - though it will be achieved fine
without the sending host needing to do anything, just by the
"anycast" or "disaster recovery" system knowing which site the packet
was received at.


Goal 'c' spreading the load over multiple servers, can easily be done
with LISP.  This includes having the servers physically scattered at
different points in the Net.  Simply have a number of ETR addresses,
and the ITR choosing randomly from them.

Still, the ITR has to somehow figure out which of these ETRs is
working and (if this is a separate thing, which it not if the ETRs
are in the end-user networks, as LISP currently assumes) which such
ETRs can reach the destination host.

LISP has some ways of doing this, but I think they are messy and
expensive - which is why Ivip ITRs do not attempt to test
reachability to the ETR, or through the ETR to the destination network.

Part of how a LISP ITR might determine reachability might be testing
to see if the ETR address has an entry in the FIB.  It will have an
entry if the ITR's RIB thinks it has a path to the BGP router (only
one such router, since we have no anycast ETRs) which the RIB thinks
is advertising the most specific prefix matching this ETR address.
That will solve some problems, but it doesn't establish that the ETR
is reachable.

(But what of an ITR which is not actually a router?  It would send a
packet to the address in question and the packet may never get to any
border router, since the prefix may not be advertised at all in BGP.
 Then, ideally, the ITR would get an ICMP destination host/network
unreachable message.  Then it would choose another ETR instead - but
has it kept the packet?  Probably not.  How can it secure the ICMP
message to stop DoS attacks?  The traffic packet needs to carry a
unique nonce in its LISP header, so there has to be a LISP header and
a UDP header in front of that - but the ICMP spec doesn't ensure
routers will actually send back enough of the original packet to
carry this nonce . . . it only needs to send back the UDP header,
IIRC, which precedes the LISP header.)

Goal 'b' would probably be achieved by a suitable solution to the
above problem - but I am yet to see it has been robustly and
efficiently solved.

How would a LISP, APT, or TRRP ITR achieve goals 'a'?

This depends on BGP "distance".  If the ITR is a BGP router, in the
DFZ, then it will have the full DFZ routing table for its specific
point in the network, and it will have "distance" metrics for each of
its best paths.  So the ITR part of the router could interrogate the
RIB and figure out which of the various ETR addresses in the mapping
has the BGP advertised prefix which is "closer".


That is costly, but would work, for an ITR which is a full DFZ
router, as would be PTRs.

However it won't work at all for ITRs which are any of:

   Single homed BGP routers (not carrying the full DFZ table).

   Internal routers (no idea of the DFZ table).

   An ITR in a device which is not a router.


So I disagree with your proposal that it is possible to use multiple
ETR addresses in LISP, APT or TRRP to implement one of the crucial
goals which is routinely achieved, without any extra effort, in
conventional BGP-based anycast (which is admittedly unscalable).


How would I do this stuff - the equivalent of BGP-based anycast or a
similar function for the Disaster Recovery arrangement - with Ivip as
currently conceived?

Ivip can't do load balancing over multiple servers at different parts
of the Net if they are all on the same IP address.  It is necessary
to do round-robin DNS for a given FQDN, such as to cause
Correspondent Hosts (CH's as I will refer to the ordinary client
host) to choose at random one of the IP addresses and stick with it
for the session.

Those multiple IP addresses are all SPI addresses and each one is in
a separate micronet.

If the servers are in one location, the load balancing will be
between two or more ETRs in different ISPs, so the end-user network
changes the mapping of each micronet to one ETR or the other,
spreading the load accordingly.  The same idea works fine if there
are multiple servers in different physical locations.

  Round-robin SPI            Separate
  addresses, each           ETR for each
  assumed to be in          server site
  its own micronet

  3.3.3.0                   45.68.23.1
  3.3.3.1
  3.3.3.2                   32.67.21.7
  3.3.3.3
  3.3.3.4                   21.62.92.7
  3.3.3.5
  3.3.3.6                   65.98.21.4
  3.3.3.7

With 800 active CHs, on average 100 chose a particular one of the
3.3.3.x SPI addresses which are returned by DNS.

Each server site has hosts which respond to every address.

Now by sending mapping changes, the end-user network can steer the
overall burden of work in increments of ~1/8 of the total burden to
its four sites.  The most obvious way is the first two micronets are
mapped to the first ETR address, the 3rd and 4th SPI addresses to the
2nd ETR etc.

If a server site died, or became too busy, a real-time mapping change
could direct one of both of the micronets to another server site.

Assuming session-based protocols and the servers not sharing the
session state, this switch would break sessions - but with an outage
or overload, some or many or the sessions were dead anyway.

Each site could be multihomed - and so could have two ETRs.  Then it
is possible to load-balance the inbound traffic for each site too.

This looks like a good solution for goal 'c' - and with real-time
control by the network administrators and/or some automated load
balancing software it may well be more useful than the load balancing
which is possible with LISP, APT or TRRP.

Goal 'b' is achieved in the same way.  The end-user network sends (or
has a probing company send on its behalf) a mapping change if one of
the sites goes down.  The probing company would actually control the
mapping and as long as all sites were found to be reachable, it would
take instructions from the end-user network itself to achieve load
balancing.  More on this in a new section:  "The actual source of
mapping changes":  http://www.firstpr.com.au/ip/ivip/Ivip-summary.pdf


But how would Ivip achieve something like goal 'a' - having the
packets sent to the "closest" operational ETR?

Ivip can't do this at present - except by using anycast ETRs which
suck and have no scaling benefits over conventional BPG approaches.

But neither can LISP, APT or TRRP, unless perhaps the ITR is in the
DFZ - and most ITRs will internal routers.


How could a host in a core-edge elimination scheme do all this?

I figure it could achieve 'c' and 'b' OK, but it has no access to the
BGP information it needs to do anything like 'a'.


So can we solve this problem by adding to all the schemes some kind
of network service (maybe always available from some *anycast*
address) by which an ITR or a core-edge elimination scheme host could
send one or more IP addresses and get back some relative metrics
about which one was "closer"?

If we could do this, then we could do Disaster Recovery with
generally optimal path-lengths (and therefore shortest times and
lowest packet loss rates) as part of the scalable routing solution in
a *scalable* fashion: that is without pushing more prefixes into the
DFZ and in a responsive manner.


I think this is well worth considering.  It is a new idea to me, but
it could be done, I am sure.

Just have a bunch of "Distance Servers", or routers, or whatever,
scattered around the local network, such as one or a few in every ISP
network, or every end-user network, and or every small section of
some big ISP or end-user network.  Each one has its own global
unicast IP address too.

Each one also responds to some predetermined IP address in an
"anycast" fashion.  (Alternatively, we need some autoconfig
arrangement for every ITR to find it.)

The Distance Server has been configured to communicate with the
nearest DFZ router.  (Further work required if there are two DFZ
routers and the IGP steers some packets to one and others to another.)

Maybe the Distance Server simply pretends to be a BGP router and gets
a full, continually updated, copy of the DFZ router's routes,
complete with the lengths of all the routes it currently has.

The Distance Server could act in some ways like a BGP router and so
pass this on to other Distance Servers, in order not to load the DFZ
router.


Then when this Distance Server gets a query from an ITR about some
ETR address, it matches it to the most specific BGP route and returns
the "BGP distance", which is the number of ASes in that route's
announcement.  A nonce in the query is returned in the reply in order
to secure the system.  UDP queries and responses will be fine, so
multiple local Distance Servers can be anycast.

ITRs (or hosts in the case of core-edge elimination systems) would
query Distance Servers with one or more ETR addresses and so be able
to choose which ETR address was apparently "closer".

This would work fine for any system in which a single EID's mapping
included multiple ETR addresses: LISP, APT and TRRP.

But what of Ivip, where there is currently a single ETR address for
every micronet?

The most obvious approach is to extend Ivip's mapping to an arbitrary
number of ETR addresses.  That could be done.  Another approach would
be to retain a single "ETR" address, but to make a certain range of
addresses have a special meaning.  These would be outside the global
unicast address range, and so could never actually be ETR addresses.
 They would index into a separate set of mapping information and
return multiple ETR addresses.

Either way, Ivip could be modified to have multiple ETR addresses for
each micronet.

Ivip ITRs are already in communication with at least one local full
database Query Server (QSD) perhaps via one or more optional caching
query servers (QSCs), which I will ignore for now.

The QSD can incorporate the functions of the Distance Server.

So the QSD, directly or indirectly, gets DFZ route "distances" for
whichever ISP or end-user network it is located within.

As before, the ITR has a packet to send which it doesn't know the
mapping of.  It sends the destination address with a nonce to its
local QSD.  As before, if the QSD's copy of the mapping database for
this micronet contains an ordinary ETR address, this is what is sent
back to the ITR, along with the starting and ending address of the
micronet and a caching time.  (Within the caching time, the QSD will
tell the ITR if the mapping for any of this micronet changes, to a
new ETR address or if the micronet is split into multiple smaller
micronets.)

If the QSD finds an ETR address in the special range, then it looks
up some other part of its mapping database to find the one or more
real ETR addresses which correspond to this special ETR address.

Then, via looking into its Distance Server sub-function, it
determines which of these is "closest".  Then it sends that ETR
address to the ITR.

This involves no extra complexity in the ITR, but adds some to the
QSD.  This sounds like a good division of labor.  It also is in
keeping with APT's Default Mapper (APT's equivalent of the QSD) which
makes decisions for ITRs based on multiple ETR addresses, and only
sends the answer to the ITR.


This looks like a perfectly good way of enabling Ivip to achieve goal
'a' of the anycast / Disaster Recovery goals.  It involves two extra
sets of complexity and storage in the QSD, but that is fine.

APT could do the same thing - integrated "Distance Server" functions
into the Default Mapper and thereby have suitable mapping options
instructing the DM to choose the "closest" ETR.

LISP-ALT and TRRP have no local query server.  So they would need a
new network element: a Distance Server.

The "Distance Server" could be integrated into LISP-ALT's Map
Resolver function.


  - Robin


  http://www.ietf.org/mail-archive/web/rrg/current/msg04901.html

> You have two sites: primary in New York (NY), disaster recovery in
> Tokyo (T). Each site has one Internet link, one router and one
> server.
>
> Sprint provides the Internet link to NY. Cogent provides the
> Internet link to T.
>
> The NY router announces 1.2.3.0/24 via BGP at normal priority.
>
> The T router announces 1.2.3.0/24 after prepending the AS# 3 times,
> causing most of the other routers on the Internet to consider the T
> router more distant than the NY router.
>
> The two sites are normally linked together with a VPN traveling
> over the Internet using 5.6.7.8 on one side and 8.7.6.5 on the
> other, addresses provided by the respective ISPs. This functions as
> if there was a private line directly connected between the two
> routers.
>
> Ordinarily, the T router sends packets for 1.2.3.4 over the VPN to
> the NY router while the NY router sends packets for 1.2.3.4 to the
> NY server.
>
> If the VPN fails, the T router will send packets for 1.2.3.4 to the
> T server instead.
>
> If the NY server fails, the NY router will send packets for 1.2.3.4
> over the VPN to the T router and the T router will send packets for
> 1.2.3.4 to the T server.
>
> Both servers answer to 1.2.3.4. Both return identical content.


[ Sprint               1.2.3.4
[           R2------R5----------[RNY]------------(SNY)
[          /  \     |             ! 5.6.7.8
[ SH1--->R1   R4---R6             !
[          \  /     |             ! VPN  (really via R5, R6, R4, R3,
[           R3------R7            !       R9, R10)
             |                    !
             |          1.2.3.4   ! 8.7.6.5
{           R9------R10---------[RT]-------------(ST)
{          /   \   /
{ SH2----R12--R13-/
{          \  /
{ Cogent   R14
_______________________________________________
rrg mailing list
[email protected]
http://www.irtf.org/mailman/listinfo/rrg

[rrg] Adding a Distance Server for anycast / disaster recovery

Reply via email to