> Date: Thu, 28 May 1998 16:42:36 -0400 (EDT)
> From: Dan Pritts <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED], [EMAIL PROTECTED]
> Subject: Re: Digital UNIX and AFS survey  (fwd)
> 
> i wonder if this filesystem corruption is related to the problems
> on the fv commerce servers.  FYI, in case you care. 

it looks like exactly the same problem we were seeing

> ---------- Forwarded message ----------
> Date: Thu, 28 May 1998 13:39:48 -0400 (EDT)
> From: Kevin Hildebrand <[EMAIL PROTECTED]>
> To: Mark Giuffrida <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED]
> Subject: Re: Digital UNIX and AFS survey 
> 
> 
> > Interesting.  We have had Sun fileservers since the dawn of afs and we have 
> > never had any filesystem corruption.  Were you running the multiple processor 
> > machines before AFS was properly MP certified way back?
> 
> > Mark Giuffrida
> > University of Michigan, CAEN
> 
> Let me see if I can elaborate more on what Randall was saying.  Here
> at the University of Maryland our primary fileserver collection has
> always been Sun hardware.  We have 15-20 fileservers, all which are
> Suns.  The problem we have been observing only affects Solaris 2.5,
> 2.5.1, and 2.6 fileservers, and only under conditions of heavy usage.
> All of the fileservers are single processor machines, SPARC 5s and 10s. 

We have seen this on Solaris 2.4  with a 4 CPU SPARC 1000.

> We have been in contact with both Transarc and Sun on this problem for
> 18 months now, and the consensus is that the problem lies in Sun's ufs
> drivers.  We are currently in the process of setting up test hardware
> so that we can work with the Sun engineering team to get the problem
> fixed.  

We were using ufs on one of Sun's SSA RAID boxes.

> The corruption appears in the form of hundreds of duplicate inodes and
> will eventually cause the server to kernel panic with messages like
> "freeing free frag" or "freeing free block".  The corruption damages
> volume headers as well as user data.  We haven't yet been able to
> duplicate the problem other than by observing our live filesystems.

exactly the same problem.  I have reproduced the error exactly once
after about a week of intensive testing (multiple processes creating
and removing large directory trees simultaneously on multiple file
systems).

> We originally observed the problem on our RAID arrays, but we have
> also seen it on ordinary disks as well.

same here.

> Up until recently we were the only site that had reported this
> problem, but I believe there is now another site that is seeing
> similar corruption.

you can add us to your list.

> Our only solution to the problem so far has been to replace our Sun
> hardware with DEC Alphas, which run just fine.
> 
> Kevin Hildebrand
> University of Maryland, College Park
> 

First of all I think it is clear to all that this is a very
obscure and infrequent bug and it probably _never_ happens in
most contexts.  I think it _can_ happen in the context of
very active file systems which ALSO have very large numbers
of inodes.  I _think_ it could very possibly result from
some sort of page fault / page boundary condition and/or
a context being restored to the wrong point after multiple
interrupts all happening during updates of the free list.

I wish I had saved my notes on this because the pattern of duplicated
inodes was VERY non-random (in blocks that were powers of 2
and beginning at very suggestive bits in the inode numbers, like
starting at an inode number that had a LONG sequence of zeros
on the right hand side in its binary representation.)

I would probably bet that this never happens if your number of inodes
fits in an int (possibly why it doesn't happen on the DEC Alphas)
though I have never really looked at the numbers to check.

If you think this sounds like wild guesses you would be
right.  I never expected that anybody else would run into
this bug.


Jerry 

Reply via email to