Cindy L. Baca <[EMAIL PROTECTED]> writes:
> We have sent system core dumps to Transarc for analysis as well as
> before and after reboot logs.
...
> Between SUN and
> Transarc you would think someone would have a clue what the problem is
> and fix it. We are still hoping.
Here at the UM (ITD), we run a variety of Solaris machines, in several
configurations running from single CPU "server" type uses (a web
server with most stuff being accessed from the local disk), to many
user multi-processor "time-sharing" systems. I'm not actually
directly responsible for any of these systems, but I do some
stuff on solaris, and I'm sort of peripherally involved in that I
know most of the people involved with sorting the problems out here,
and see many of them daily. I'm also the kind of person who
likes to root around a bit and figure out what's really
happening, so here's what I've learned.
I ran across two kinds of AFS solaris crashes while doing various
"random" things. They were both "usage specific". The first one was
a vunerability in the cache manager with certain kinds of network
traffic. It's a vicious enough bug that you absolutely want to get the
latest patched AFS you can get ahold of. I never bothered to dig into
that one, other than to note that crash & adb are definitely crippled,
and *most* painful to use when debugging kernel extensions.
The 2nd one was rather more interesting - I had managed to
write a program that "double invoked" the cache manager. That caused
it to panic because it tried to acquire the AFS global lock when it
already had it locked. (The particular problem has since been reported
to Transarc, and I'm sure they're hard at work on a fix.) I spent
a bit of time figuring out how to use "crash" and "adb" to analyze
the dumps - the traceback I constructed by hand matched a later
traceback obtained from "kadb", by one of the people who is
involved with sorting the problems out. It was pretty tedious
work - it took me several hours, using the same target machine
and software to sort out addresses. I can well imagine the
kind of tedious nightmare Transarc faces when trying to analyze
the core dumps sent to them.
Based on that, though, I think I understand enough to understand
when, where, & why you might see "lock-ups". Basically, the
problem is that the AFS code was not originally designed with
the idea of a multiple processor kernel architecture in mind.
Multiple processors is a real problem, because it means you
can't just use "spl()" to single-thread your kernel code, but
you have to do all sorts of ugliness with "mutex"'s and
spin-lcoks on all your data structures. In fact, to get optimal
results, you really have to rethink your data structures;
and if you look at the kernel differences between various versions
of Solaris, you can start to get a feeling for what is involved here.
[ The changes just between 2.3 & 2.4 seem to make a *big* difference
in performance. ]
In Solaris terms, the bulk of the AFS cache manager is "MT-unsafe". In
order to ensure there's only one kernel thread in the AFS code at a time,
the AFS folks created a global lock, which ensures there's only one kernel
thread running at a time, at the most. There are 134 calls in about
52000 lines of code in the AFS cache manager, so it's obviously a
monumental job to go through and make sure that there is no possible
code path that will try to lock it when it's already locked (and
panic).
A second problem is in terms of just performance. Since the AFS cache
manager is more or less optimized for a uniprocessor system, there
are many parts of the code that tend to "use" a cache manager data
structure for a relatively "long" period of time. Some of those
"long" periods of time turn out to be unanticipated "features" of
the AFS cache manager - one of the hashing functions turned out
to be bad, and was putting a lot of things on one hash chain;
which meant everything piled up waiting for things to happen
on the one hash chain. That's a problem even on a uni-processor
machine, but it's a lot worse when you have the multi-processor
code in the picture too. I'm not positive on this (the Solaris
man pages are very silent on this), but I *believe* the solaris mutex's
are "spin locks" - that means when one kernel thread tries to acquire
the lock, but another processor is already using it - it sits there
burning up CPU waiting for the other processor to relinquish the lock.
That's fine for a lock that's released within microseconds; but bad
news if the other thread is going to be busy for 100 milliseconds doing
network I/O.
I'm sure Transarc is hard at work doing something about this. Even if
Solaris is not their main product interest, it seems that they would want
AFS to work flawlessly on AIX 4.1; and since AIX 4.1 is *also* a
multi-processor system, and since that's sold by their parent company,
they've clearly got a strong incentive to make it capable of doing the
Right Thing on a multi-processor machine.
The version we had of Solaris & AFS a year ago worked just as
badly as <[EMAIL PROTECTED]> describes. The very latest versions we have
work a *lot* better. There are still one or two minor problems and
it needs more testing, but nobody here doubts we will have a good version
(better than anything we have at present, for *any* platform) by fall.
-Marcus Watts
UM ITD RS Umich Systems Group