> Date: Thu, 28 May 1998 16:42:36 -0400 (EDT) > From: Dan Pritts <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED], [EMAIL PROTECTED] > Subject: Re: Digital UNIX and AFS survey (fwd) > > i wonder if this filesystem corruption is related to the problems > on the fv commerce servers. FYI, in case you care. it looks like exactly the same problem we were seeing > ---------- Forwarded message ---------- > Date: Thu, 28 May 1998 13:39:48 -0400 (EDT) > From: Kevin Hildebrand <[EMAIL PROTECTED]> > To: Mark Giuffrida <[EMAIL PROTECTED]> > Cc: [EMAIL PROTECTED] > Subject: Re: Digital UNIX and AFS survey > > > > Interesting. We have had Sun fileservers since the dawn of afs and we have > > never had any filesystem corruption. Were you running the multiple processor > > machines before AFS was properly MP certified way back? > > > Mark Giuffrida > > University of Michigan, CAEN > > Let me see if I can elaborate more on what Randall was saying. Here > at the University of Maryland our primary fileserver collection has > always been Sun hardware. We have 15-20 fileservers, all which are > Suns. The problem we have been observing only affects Solaris 2.5, > 2.5.1, and 2.6 fileservers, and only under conditions of heavy usage. > All of the fileservers are single processor machines, SPARC 5s and 10s. We have seen this on Solaris 2.4 with a 4 CPU SPARC 1000. > We have been in contact with both Transarc and Sun on this problem for > 18 months now, and the consensus is that the problem lies in Sun's ufs > drivers. We are currently in the process of setting up test hardware > so that we can work with the Sun engineering team to get the problem > fixed. We were using ufs on one of Sun's SSA RAID boxes. > The corruption appears in the form of hundreds of duplicate inodes and > will eventually cause the server to kernel panic with messages like > "freeing free frag" or "freeing free block". The corruption damages > volume headers as well as user data. We haven't yet been able to > duplicate the problem other than by observing our live filesystems. exactly the same problem. I have reproduced the error exactly once after about a week of intensive testing (multiple processes creating and removing large directory trees simultaneously on multiple file systems). > We originally observed the problem on our RAID arrays, but we have > also seen it on ordinary disks as well. same here. > Up until recently we were the only site that had reported this > problem, but I believe there is now another site that is seeing > similar corruption. you can add us to your list. > Our only solution to the problem so far has been to replace our Sun > hardware with DEC Alphas, which run just fine. > > Kevin Hildebrand > University of Maryland, College Park > First of all I think it is clear to all that this is a very obscure and infrequent bug and it probably _never_ happens in most contexts. I think it _can_ happen in the context of very active file systems which ALSO have very large numbers of inodes. I _think_ it could very possibly result from some sort of page fault / page boundary condition and/or a context being restored to the wrong point after multiple interrupts all happening during updates of the free list. I wish I had saved my notes on this because the pattern of duplicated inodes was VERY non-random (in blocks that were powers of 2 and beginning at very suggestive bits in the inode numbers, like starting at an inode number that had a LONG sequence of zeros on the right hand side in its binary representation.) I would probably bet that this never happens if your number of inodes fits in an int (possibly why it doesn't happen on the DEC Alphas) though I have never really looked at the numbers to check. If you think this sounds like wild guesses you would be right. I never expected that anybody else would run into this bug. Jerry
