> similar things on a router - it has died 3 or 4 times (over a period
> of a few months) with such an error with very little traffic passing
> through it and a stream of the 'dst cache overflow' errors on the screen.
Actually, it is quite unusual. The problems with garbage collection
are all transient, essentially you do not see anything bad but
some annoying messages.
If the machine dies... Well, it cannot be the reason of death.
I would even suspect that "dst cache overflow" was not reason of death,
but rather a consequense.
If the death means just a loss of network connectivity, it could
mean that you experience a _true_ (not related to gc problems)
dst cache overflow i.e. it happens because some part of kernel leaks
dst cache entries. It is the first thing to check, see below.
> a patch by Denis Lunev that is currently in one of the 2.6.13-pre's
> ('Fix too aggressive backoff in dst garbage collection' git commit number
> f0098f7863f814a5adc0b9cb271605d063cad7fa )
It will not help, it is a transient problem.
> I'll give those a go, but I wondered if there was a general set of
> diag/monitoring that would be useful so that if it does it again I can
> present some useful debug? At the moment I have a rtstat 60 running into
> a log file.
Plus run "ip route ls cache" periodically.
The first thing, which you should watch is difference between
number of entries shown by "ip route ls cache" (alive entries)
and rtstat (it shows all, including lost ones).
If the difference gradually grows with time, we definitely see a leakage.
> P.S. Is there a good description of how these caches work? I was looking
> at dst.c and route.c and they both seem to have garbage collection
> mechanisms and the 'dst cache overflow' comes from the ipv4/route.c -
> at this point I'm rather confused between the relationship of the code
> in dst.c and that in route.c and all the various garbage collection that
> goes on.
While lifetime of a dst cache entry, it is always on some list.
It starts in routing cache (ipv4/route.c), such entries are visible with
"ip route ls cache". Garbage collection routine in route.c searches
for stale _unused_ entries, removes them from hash table and releases them.
Normally and logically, that's all. :-)
But sometimes we have to delete an entry which is still in use. In this
case, it is removed from this hash table and moved to garbage list in dst.c.
The entry waits in that list until all the references to it are released.
In 2.6 a new player appeared, after removal from hash table, entry is
not put to garbage list immediately, but waits for some time in RCU list.
Before 2.6.9, it was quite a problem, this list used to stall, in 2.6.9
it was repaired. But it is possible, that the fix is broken itself.
Dst cache oveflows can happen because of a bug in all three stages
of "garbage collection", but those problems are always transient.
Really bad overflow happens when lots of entries remain in use, because
someone forgot to release the references to dst cache entries.
It is the first thing to check.
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html