Hi Bill, Thanks for your explanation:
http://psg.com/lists/rrg/2008/msg00478.html which helped me understand: http://bill.herrin.us/network/trrp-aapip.html I now recognise that the TRRP Waypoint Router system bears no resemblance to LISP-ALT. You wrote: > A waypoint is a combination ETR/ITR. It can accept packets with > any encapsulation that it advertises I am not sure the Waypoint Router (WR) advertises anything - except perhaps as noted later where it advertises its prefix in BGP. It has an entry in the v4.trrp.arpa DNS structure (a domain name and a record which can be returned by querying that domain name) which tells ITRs that it accepts packets for a given prefix, with various parameters for what guarantees it makes about delivering them, and for how long the ITR should send "initial" packets to it. > and will if necessary re-encapsulate them in a manner that the > next waypoint or final destination accepts. I understand that the initial ITR treats the WR as it would an ETR - it tunnels the traffic packets to it. The WR typically tunnels the traffic packet either to another WR or to an ETR (perhaps one device could play both roles). > Because there are perhaps a couple hundred top-level waypoints I assume IPv4 in this discussion and that they would be for: 1.0.0.0/8 2.0.0.0/8 .... 223.0.0.0/8 not counting 10, 127 etc. > (compared to millions of final destination ETRs) and because > waypoint maps have a long timeout it's highly probable that the > ITR has already cached a valid waypoint. OK. Any ITR which has been running for a while will already have found out about the /8 WRs for the /8s it has been tunnelling packets for. WRs have very stable IP addresses and the ITR caches their details for a long time. This could cause trouble when you do move a WR. I guess you would keep one running on the old address for a week or whatever after it was last mentioned in the v4.trrp.arpa DNS. > If it hasn't, the ITR has a search algorithm that guarantees it > quickly will. Yes, the first time the ITR handles a packet addressed to some /8 it has not had a packet for yet, but this will be rare and generally take 0.4 seconds or less in most cases. >> Can you give a more concrete example of how these Waypoint >> Routers would be structured? > > Sure. Lets say we have a source IP at 126.0.0.1 and he want's to > talk to me in the swamp at 199.33.224.1. So, he sends the packet > out. Call it packet A. There is no BGP route that covers > 199.33.224.1, Except perhaps as noted later regarding WRs advertising their prefixes in BGP. > so packet A follows 0.0.0.0/0 to the nearest TRRP ITR. I assume this is an igp default route, so the ITR is located inside some end-user or ISP network. If you have such ITRs advertising to the routers of other ASes in the DFZ, then you will be supporting traffic from non-upgraded networks, just as with Ivip's "anycast ITRs in the core/DFZ" or LISP's "Proxy Tunnel Routers". > The ITR doesn't yet have a map for 199.33.224.1. However, he does > have a waypoint map for 199.0.0.0/8: apparently the US > Government has decided to be nice and offer an "initial" mode GRE > waypoint as a public service at 148.129.75.8, which is within > globally routable (BGP routed) space. OK. The ITR would have discovered this WR at 148.129.75.8 and some information about it via a record it was sent after a query about the domain name: 8.waypoint.199.v4.trrp.arpa which returned something like "00,wp,limited=5 80,g4,148.129.75.8". > So, the ITR immediately encapsulates packet A in GRE and sends it > to 148.129.75.8. Then it initiates a lookup for > 1.224.33.199.v4.trrp.arpa so that it'll be able to send > subsequent packets directly. OK. > 148.129.75.8 doesn't have a MAP for 199.33.224.1 either. However, > I have a private waypoint set up for all of 199.33.224.0/23 in > "generous" mode at 71.246.241.146 (which is also within globally > routeable space). It accepts GRE as one if its formats. I have > made arrangements with 148.129.75.8 to keep this knowledge in his > cache. Essentially, I push this knowledge to him. My first critique is about security. How can 148.129.75.8 know that your WR 71.246.241.146 is authorised by you, the person to whom these 512 IP addresses 199.33.224.0/23 of TRRP-mapped address have been in some way assigned? Without some fancy security, an attacker could pretend their WR is for your /23. It would be OK if the address you push to 148.129.75.8 is the same as one of the ETR addresses mentioned in at least one of your micronets within that range (as it happens to be in this example), because this is an independent way 148.129.75.8 can ensure that you control, or authorise the use of, whatever is at this address 71.246.241.146. Maybe the solution is to have your WR advertised in the DNS. Then, you either push this information to the /8 WR or allow it to trawl through the DNS to find all the WRs for each subdomain (subset of the /8 space) it finds. But then it needs to be told if you move the /23 WR - so push is probably the best approach, with the /8 WR verifying the pushed information by checking it matches what it finds in the DNS when it queries: 24.waypoint.0.224.33.199.v4.trrp.arpa The ITR sends the initial packets to the /8 WR, because it hasn't cached this /23 WR information. It doesn't query the DNS so see if there is a /23 WR, because that would take as long as asking for the ETR address. > So, 148.129.75.8 keeps packet A in GRE and sends it on to my > waypoint at 71.246.241.146. He doesn't bother trying to look up > 1.224.33.199.v4.trrp.arpa because I've told him that my waypoint > operates in "generous" mode. OK - your WR will accept all packets for your /23, with the idea that ITRs only send them while they are awaiting a response to their mapping request. > My waypoint at 71.246.241.146 receives packet A. He's directly > attached to an authoritative DNS server that knows the current > TRRP map for 1.224.33.199.v4.trrp.arpa and probably already has > it in his cache. In this case, it happens to be directly > available (the waypoint was also a final ETR) so 71.246.241.146 > decapsulates packet A and delivers it to 199.33.224.1. So far so good. > Around the same time, the DNS request for > 1.224.33.199.v4.trrp.arpa from the original ITR reaches the > authoritative DNS server and the reply starts making its way > back. When the response arrives, the ITR doesn't send any more packets via 148.129.75.8, but uses the ETR address it got from the response - actually it makes a choice from typically multiple ETR addresses. But what if your /23 is actually split into multiple micronets, and some of their ETRs are nowhere near your WR? The system would still work, but could involve longer paths. > Note that US Government could have been replaced with Money > Grubbing Company and "initial" mode could have been replaced with > "limited" mode. The differences would have been: > > 1. The first ITR would have held a copy of packet A and would > have sent a second copy once the map lookup for > 1.224.33.199.v4.trrp.arpa succeeded. > > 2. If I didn't pay MGC to pass my packets to me, he would have > dropped packet A instead of sending it on to my waypoint. I'd > have had to wait for the ITR's second copy. Which the ITR would tunnel directly to your ETR as soon as the ITR receiving the mapping response. > Note also that 148.129.75.8 would likely announce 199.0.0.0/8 > into the BGP table so that networks without TRRP ITRs could find > their way to my TRRP ETR. OK - to the extent that this is true and workable, TRRP uses a similar techniques as Ivip's "anycast ITRs in the core/DFZ" and LISP's "Proxy Tunnel Routers" to attract and tunnel packets sent from networks without ITRs. So far, you have mentioned only a single WR. All you have to do is tell me that these WRs must be, or should typically be anycast, so multiple such WRs doing the same job are scattered around the Net, advertising the same prefix - and give me a name for this technique (maybe "Anycast Waypoint Routers in the DFZ") and I would say that TRRP is potentially incrementally deployable, at least in this important respect, with ways of ensuring relatively short paths and good load sharing between these WRs. It would be somewhat different for TRRP to have one set of ITRs around the Net anycasting one or more /8s and another set anycasting another one or more /8s, but it is close enough to the basic Ivip concept of each such ITR advertising all Ivip MABs for it to be regarded as a close cousin. I am not sure exactly how Proxy Tunnel Routers would be organised - each one advertising every prefix which encompasses space mapped by LISP, or some covering part of the space and others covering other parts. Ivip already has the concept of load sharing between multiple ITRs by each one only advertising a fraction of the MABs (Mapped Address Blocks), so your approach, if implemented with anycast, is the same as one way Ivip could be used. In principle, with multiple RUASes each handling a subset of MABs, one could imagine each RUAS establishing its own set of anycast ITRs in the DFZ, each set only advertising that RUASes micronets. It could start this way, as RUASes competed to support their space with well placed, high capacity ITRs. In time, it would probably make sense for them to form a consortium to run a single set of anycast ITRs around the Net, or to subcontract their ITR needs to ITRs-R-Us LLC who would have multiple sites and ITRs there which advertise the MABs of all its client RUASes. > How exactly we do this is an open > issue, one of the unfinished things about the document... Any > network which -does- have a TRRP ITR shouldn't insert that route > into its FIB, or should locally override it from the ITR. How > does it know to do so? It's my "holey routes" problem wrought > large. It is a pretty easy problem to solve if you only have one WR for each /8. Whether they are anycast or not, and whether or not you have multiple IP addresses for each /8 WR (so the ITR can choose between multiple physically separate WRs), as long as you only have a few hundred of them and they are all handling /8s, then you can afford to have a relatively static configuration item in every TRRP ITR: Ignore any /8 advertisements on the following list of /8 prefixes for the purpose of deciding whether there is a normal BGP route to those prefixes. Those are all actually routes to WRs, which we may choose to use for initial packets. However, this few hundred /8 WRs looks to me like it won't scale. My second critique is that you really can't have a single WR for each /8, for a number of reasons. 1 - Too much load on a single WR. Fix with multiple machines at the one site, or by using anycast. Also fix with multiple WR addresses for the one /8, so that the ITR can choose any one and so spread the load. You could also split the system up into a much larger number of WRs, each for a longer prefix. However, that makes it less likely each ITR will have the WR it needs already in its cache. (ITRs could scan the DNS to a certain depth to find all WRs, say to /16 - and cache all them. However, you need to keep them stable, or have the ITRs periodically scan the DNS quite often.) 2 - Single point of failure. Solvable with anycast and with multiple WR addresses - but how could an ITR know it was sending packets to a WR which was working, if it just got the address from DNS some time ago? One of those going down would blackhole some subset of the initial packets to this /8. 3 - Long paths. The ITR is in Shanghai and so is the ETR it would choose to tunnel the packets to, but the ITR doesn't know this yet because the mapping response hasn't arrived. (Maybe the nameservers haven't been fully anycast as I suggested in my message 488.) The /8 WR is in Washington DC, so the initial packets have to cross back and forth across the Pacific and the USA. This is slow and inefficient. Solution 1: Anycast the WRs widely. Solution 2: If anycast is not used, supply a number of WR addresses and somehow enable the ITR to figure out which is closest. Long paths mean long delays, larger costs of carrying traffic which really should just be going from one place to another in Shanghai, and a greater chance of the packet being dropped. 4 - A /8 may need to know an excessive number of ETRs or other WRs to send packets to. The /8 could handle the TRRP mapped space for hundreds of thousands of end-users, and each one may have two or more micronets, which are currently mapped to ETRs in very different locations. Then, each such micronet will need its own WR, unless you are prepared to live with overly long paths for initial packets. This is probably OK, but it requires some really snappy administration to make it secure. Maybe the DNS checking system I mentioned is what you had in mind. Maybe the the WR system could be defined somewhat differently, sticking with the original concept of a single WR for a whole /8. We assume that the entire mapping database is too big for any router to cache - the whole basis of TRRP's or ALT's appeal is that it scales endlessly. However, if there is a /8 WR somewhere which gets all the early packets from every ITR in the world, then that can operate simply as a super ITR, and will have already cached the mapping for pretty much any mapped address which has recently been receiving traffic. Whether it is /8 or /12 or /16, you have a bunch of these things, called WRs, and each is just a fast, large cache, ITR. It doesn't need the concept of end-user specific WRs - it just looks up and caches the mapping like any other ITR. The advantage is that it will generally already have the mapping in cache, so it will be able to tunnel the packet to the ETR immediately. However, you really need lots of these around the Net, anycast. Then, the anycasting dilutes the traffic seen by each one, which is good in one way, but increases the chance that each one won't have cached mapping for the micronet when it needs it. To soup up the WR system as I understand you described it - 200 or so /8 WRs - I think the most obvious thing to do is to anycast them and to split them up to handle longer prefixes each (less address space). A suitably anycasted system could have a complete set of WRs at 36 locations around the Net, as I listed in message 488. They might as well be in the same locations as you have the complete set of anycast nameservers for the trrp.arpa domain and subdomains. Now, there's no real reason for an ITR to send both a mapping request and the initial packet as separate things, because both packets would go to the same anycast site. You could simply have the ITR tunnel the initial packet to some anycast address, knowing it will find its way to your nearest anycast site, and be tunnelled quickly to your chosen WR (which you chose the location of to be close to, or at, your ETR site). The source address of the outer header tells your anycast server the ITR's address and the destination address of the inner packet tells it what IP address the ITR needs the mapping for. This is just like APT's Default Mapper, but is anycast to a few dozen sites around the Net. This is a hybrid push-pull global network of a few dozen Default Mappers. You are pushing the entire database, plus information about WRs, to these few dozen sites. The more sites you have the greater the load sharing and the lower the delay time in getting mapping to the ITRs. Likewise the shorter the extra path length travelled by initial packets. It would overcome much of the delay and bottleneck problems of TRRP as you currently propose it: Anycast DNS servers authoritative for each /8 of the address space, but probably separate sets of anycast servers at different sites depending on which RIR or whatever is running these DNS servers. DNS servers for the /24 prefixes are far more numerous and run' by far more organisations than the few RIRs etc. So presumably this means they are not anycast, and that delay times for this second request will often be pretty long. Likewise, worse risk of packet loss. 220 or so sites, each for a WR for an entire /8. Likewise scaling problems, long path problems etc. meaning more delays. >> To what extent does your system resemble ALT, and to what >> extent does my critique of ALT apply to your system? > > Without having reviewed ALT in any detail, my best guess is that > it doesn't share much besides the notion of a highly aggregated > alternate path. OK. ALT automatically passes the packet up and down a hierarchy, which is strongly aggregated with the unfortunate result that each such router could be almost anywhere, so the overall path length for the whole ITR to ETR trip could be very long indeed. By "highly aggregated" in TRRP, you mean the /8 WR already knows the exact IP addresses to send packets to for every micronet in its /8. > ALT sounds like it might work with static tunnels or private > lines using something close to standard BGP. On the down side, it > would require a complex dance between thousands of operators to > get it going. Yes. > With Waypoints, the complex dance is at the RIR > public policy level getting authorization to announce for that > supernet. Once authority is obtained, they hook up at any old > place and announce the prefix. OK - it is very direct - /8 WR to /XX WR, for each micronet or for whatever subset of the space the end-user chooses. I assume that micronets for IPv4 can be as small as a single IP address. Will the RIRs be happy to run a WR which has theoretically up to 2^24 separate destination WRs? What about traffic volume levels? Right now, RIRs probably charge you and Google the same for X amount of address space. But if Google really gives their WRs a hammering, the RIRs should be charging Google according to their higher traffic volume. >> By your own description, the Waypoint Router path is "long" - >> compared to going direct in a tunnel to the ETR (the address of >> which is not known at this time). Presumably this "long" path >> will be faster than waiting for the mapping information to >> arrive. >> >> Do you have estimates for the delay times? > > My SWAG is that the initial round trip will be 1.5 to 2 times the > normal round trip with some single-digit percentage taking long > enough to recognize no gain versus bare TRRP. I don't clearly understand this. If you have a single WR for each /8, then some ITRs are going to be on the other side of the Earth with respect to it, and so are the ETRs they are trying to send packets to. Worst case delay times could be long unless you have an elaborate anycast network. Regards - Robin -- to unsubscribe send a message to [EMAIL PROTECTED] with the word 'unsubscribe' in a single line as the message text body. archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg
