Leland,

I'm guessing that the SSH session that is freezing is originating
somewhere outside the 10.14.3.0/24 network (see later).

I have found that although the QETH driver interprets a 'cable-pull' event
when a NIC is uncoupled, the interface is often still marked 'UP' (this is
definitely the case when a simulated NIC is uncoupled from a Guest LAN).
When you have disconnected the NIC, check to see what status the interface
is in and what the route table shows.

My guess is that the route to 10.14.3.0/24 via eth0 is still at the top of
the table, sending your reply packets into a dead network.  This will stop
any traffic to 10.14.3.0 from reaching its destination -- in fact, I
notice that the default route is against eth0 also, so no traffic will be
able to exit your guest (as you found with your SSH session).  See what
difference it makes when you configure the eth0 interface down (ifdown
eth0) -- this will remove the 10.14.3.0/24 via eth0 route.  You still
won't have a default route, but traffic to the eth1-connected network will
keep going.

Also, for inbound traffic, if the ARP cache in the client machines is
holding the MAC address of OSA1, the other OSA will not be used as an
inbound path until the cache in that client is cleared.  If this client
happens to be your router...

Doing this kind of recovery automatically would be tricky.  You could
implement a scripted process based on Adam Thornton's VRT, or use some of
the health-checking function from keepalived (for example), to
automatically configure an interface down and raise an alert when an it no
longer flows traffic.  You could also use a dynamic routing protocol to
advertise your VIPA address to the network, but this may not be desirable
to you.

Hope this helps; get back to us with the results of your tests.

Cheers,
Vic



On Mon, 9 Jun 2003, Lucius, Leland wrote:

> I "think" I almost have it working, but I just can't get it all the way.  I
> have read and read and read as much info as I could find about it and, near
> as I can tell, I've done everything correctly.  The problem is that after
> takeover, guests sharing the same OSA can no longer talk to each other until
> the failed OSA is back up...

Reply via email to