Re: Bug in routing logic?

Christopher W. Curtis Fri, 14 Apr 2000 11:06:14 -0700
"Dr. Michael Weller" wrote:
> 
> On Wed, 5 Apr 2000, Christopher W. Curtis wrote:
> 
> > Well, this isn't really true.  I used to have both gateways installed as
> > a metric 1, and when one when down, the other would take over.  I could
> > tell the difference by doing a tracepath and seeing different gateways
> > being used, without any change in my routing tables.
> 
> Hmm, I've problems believing that, maybe if both gateways were on
> different nets and linux noticed the corresponding network card went down
> ?

I don't know what to tell you.  (Sorry for the late reply, btw.)  I have
one default gateway.  As Linux choses the first default gw (gw1) for
routing, I add the new one (gw2) and delete the first.  Doing a
tracepath uses gw1, even though it is not in my routing table.  This
appears to be a cache problem, though, as tracepath'ing a different site
uses gw2.  I then add the first gateway back and tracepath various
sites; all use gw2, which is the first entry in the routing table listed
by 'route'.

I then disconnect gw2 from the net and do a tracepath.  I cannot access
the network at all.  However, if I tracepath a different site, I can get
through.  Here is my log [abbreviated]:

[254 is spanky; 252 is styx]

default         styx.aet-usa.co 0.0.0.0         UG    0      0        0
eth0
default         aet-usa.com     0.0.0.0         UG    0      0        0
eth0
eclipse:~# tracepath ee.fit.edu
 1?: [LOCALHOST]      pmtu 1500
 1:  192.168.68.252     1ms 

<disconnect>

eclipse:~# tracepath ee.fit.edu
 1?: [LOCALHOST]      pmtu 1500
 1:  no reply

default         styx.aet-usa.co 0.0.0.0         UG    0      0        0
eth0
default         aet-usa.com     0.0.0.0         UG    0      0        0
eth0
eclipse:~# tracepath slashdot.org  
 1?: [LOCALHOST]      pmtu 1500
 1:  192.168.68.254     3ms 

I've omitted only stuff that is irrelevant.  I can send you the dump of
my xterm if you wish but you are welcome to try this yourself - it works
repeatably under 2.2.13 on i386, as long as both have a metric of 0. 
Making spanky a metric 1 breaks this (very desired) functionality.

> > Now, this works for me, but the problem is that which one Linux choses
> > is done by order of definition (minor nit) and then there is no way to
> > bring it to the other gateway when I bring it back up, except to delete
> > the stable one and then re-add it.  I thought that I could do this by
> > specifying a metric.
> 
> This sounds more sensible.. so you say linux uses always the last
> setting.. maybe it should silently remove the overridden one than, or just
> refuse to add the new root (to tell the admin he does something which
> won't work)

It actually chooses the first setting if it can.  If not, it'll fall
through to the next gateway, provided that gateway is of the same
metric.  I haven't examined the code ...

> > I don't know much about TCP/IP, but I know that TCP reports 'Host
> > Unreachable' messages, and I would think that using these it could use a
> 
> Who would generate these messages? When the gateway is on the same

True; however, Linux can tell if a host doesn't respond.  Again, I don't
know the mechanics, but as the default route is a target, it should
address the packet to that host for forwarding, which I would then think
that ICMP 'success' type messages would be expected by the sender, or
some sort of 'ACK' for instance.  I realize that dropped packets are
resent at the request of the destination host so it seems like there's
no need for a gateway to return success type messages, but Linux already
somehow already knows when a default route is gone - perhaps by
attempting to send (my initial tracepath) and then realize that the
packet never got anywhere.  Those initial routes are still stuck in the
cache, but perhaps the host is market 'not reachable' and passed when
adding to the cache ...

Again, I haven't looked at the code, but this seems a reasonable
explination of how this mechanism may be working (which is how it is
working, btw) ...

> > next route with a higher metric, since it already will choose a
> > different route with the same metric.  I haven't thought it out much,
> > but if a route was marked unreachable, it wouldn't seem unreasonable
> > that Linux intermittently poll this route (once a minute?) to see if a
> > lower-metric route that went awry came back.  (Perhaps route is wrong,
> > but host and gateway should be able to work like this.)
> 
> Well, what you describe is in most part exactly what a routing protocol
> does. If you need this dynamic behaviour setup a routing protocol between
> the necessary gateways. For your situation a specific script sending some
> pings and setting up routes every few seconds might just do it as well.

Good idea - when I get back to that project I may do just this.  I'm
going to try 2.3.99-pre3 first and see if I can prevent the crashes, or
at least get a better handle on what's going wrong ...

> change the route only if the active router changes). BTW, of course any
> data coming back to you needs to take the other route too.. (or is this
> some fancy masquerading setup?) so this can normally only work if all
> routing hosts have some routing protocol running and know about each

You kinda lost me here -- it is a masquerading setup, but I don't think
it's all that fancy.  I just want the existing rollover to graduate up
to a metric 1, and fall back to a metric 0 gateway if it becomes
available again.  The latter may be difficult to determine; I'd be happy
if it just went to a metric 1, and then re-adding the previously broken
route at metric 0 would cause it to automatically chose the less
expensive gateway.

> However, there is no reason to put this check route and modify settings
> process in the kernel.  A userlevel tool can do it easily and there are
> some.

I'll partially agree.  Some of the code already seems intact, it just
doesn't chose a higher metric when it determines fallover.  The problem
now is that I have to delete both routes and add them in the proper
order to restart testing.  I'd like to not have to do this, but if it's
a simple mod (I may even look myself) it would be 'better' if I just had
to re-add a single route.

Christopher
-
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to [EMAIL PROTECTED]
Re: Bug in routing logic?

Reply via email to