Hi,
  Alexey and Herbert - thanks for the replies.

Alexey wrote:

> > similar things on a router - it has died 3 or 4 times (over a period
> > of a few months) with such an error with very little traffic passing
> > through it and a stream of the 'dst cache overflow' errors on the screen.
> 
> Actually, it is quite unusual. The problems with garbage collection
> are all transient, essentially you do not see anything bad but
> some annoying messages.
> 
> If the machine dies... Well, it cannot be the reason of death.
> I would even suspect that "dst cache overflow" was not reason of death,
> but rather a consequense.
> 
> If the death means just a loss of network connectivity, it could
> mean that you experience a _true_ (not related to gc problems)
> dst cache overflow i.e. it happens because some part of kernel leaks
> dst cache entries. It is the first thing to check, see below.

I think the machine was still alive; but all it does is
route so there wasn't too much to tell; certainly it had
stopped routing (most?) traffic a period of about 10 hours
before I got to it and was still very ill - so it isn't
a transient thing.
The router handles outgoing traffic and routing between two
small subnets (probably 200 ish IPs or so on each); it doesn't
open any connections itself and it isn't directly on the outside
world.

One thought; the day before it fell into this state there had
been a minor screw up on one of the networks where someone
mispatched two subnets together (one of which was one
of the ones connected to this box); now that may have
caused a lot of arping and general unhappiness - but it all
seemed to resolve itself; I don't think similar problems
had happened before the previous failures.

> > a patch by Denis Lunev that is currently in one of the 2.6.13-pre's
> > ('Fix too aggressive backoff in dst garbage collection' git commit number
> > f0098f7863f814a5adc0b9cb271605d063cad7fa )
> 
> It will not help, it is a transient problem.

OK.

> Plus run "ip route ls cache" periodically.

OK, I'll add that to some monitoring.

> The first thing, which you should watch is difference between
> number of entries shown by "ip route ls cache" (alive entries)
> and rtstat (it shows all, including lost ones).
> 
> If the difference gradually grows with time, we definitely see a leakage.

OK.

> <explanation of route.c and dst.c>

Thanks for that explanation - it helps somewhat - one thing I was
confused by was why the timer mechanism for the garbage collection
was so elaborate; why does it do all that back off stuff and
adjusting itself? Why not just run at some fixed rate?

* Herbert Xu ([EMAIL PROTECTED]) wrote:
> Alexey Kuznetsov <[EMAIL PROTECTED]> wrote:
> > 
> > Really bad overflow happens when lots of entries remain in use, because
> > someone forgot to release the references to dst cache entries.
> > It is the first thing to check.
> 
> Yes.  I once had a situation where a buggy user-land program held
> many sockets open each of which had ancient packets stuck in their
> receive queues.  The result was a lot of dst entries hanging around.

Nod - I don't think it is that in this case because the machine
doesn't open any connections itself.

> In such cases checking /proc/slabinfo could be useful.

But I will try and remember that next time it goes or add
it to the monitoring scripts.

Thank you for your suggestions; if I'm unlucky you'll
see a question from me (with some more debug) in a month
or two if it does it again!

Dave
--
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    | Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to