We run a NetApp F840 filer with a 350GB volume which is mounted by our FreeBSD
4.x clients using NFSv3.

Today it seems that two fopen() calls for the same file on that mounted volume
yielded file streams associated with two different files.

Two different processes on the same host, run at approximately the same time
(although not concurrently) seem to have got FILE *streams mixed up. Each
process chdir()s to a different directory on the mounted volume, readdir()s and
then operates on the files using the relative path, rather than an absolute
path to the file.

The simplified chain of events was:

 Process 5453 (starts at 19:09:46 ends at 19:09:48):

   - File A is stat()'d by process only

 Process 5592 (starts at 19:10:02 ends at 19:11:41):
   - File B is stat()'d
   - File B is fopen()'d
   - Contents is read with fgets()

       (the contents read at this stage, is actually the contents of File A,
        not File B)

   - File B is fclose()'d
   - [other operations are performed]
   - File B is fopen()'d again and the contents parsed for a unique token

       (this time, the contents relates to File B and the correct unique token
        is found. This token was *not* read when the file was previously

   - File B is fclose()'d and unlink()'d

File A and File B are in different directories on the volume.

The unique token is used in logs to provide an audit trail. It is logged when
the file is written and when it is unlinked. The net result of the events
described was that process 5592 read the contents of the wrong file, before
unlinking the correct one; effectively the contents of File B were lost.

Given that the two processes were executed within twenty seconds of one
another, I wondered if some NFS caching either on the server or client side was
causing this behaviour. The client in question was running 4.6-STABLE as at Jul
17 2002.

Does it seem plausible that sys/nfs may have cached File A's information and
associated stream B with it in error? I've had a cursory look at CVS commits
relating to NFS since July 2002 in the 4-RELENG tree, but I admit to not being
an expert in this area and didn't spot anything.

I have a case open with NetApp in case this could be attributed to an error on
the filer, such as an inconsistent filesystem, although I've not yet heard
anything back.

I don't think this behaviour will be easily reproducible as the cluster causes
around 3000 NFS operations per second on average each day, and this sort of
behaviour has only been brought to my attention twice in the last month.

I have implemented a little sanity check after fopen to check that the inode
associated with the file is the same as the inode of the file associated with
the stream before proceeding, but this may not help if file credentials are
being incorrectly cached on the NFS client. Still, it can't do much harm to do
the check.

Pseudocode without error checking:

  FILE *f;
  struct stat ssb, fsb;

  f = fopen(filename, "r");
  stat(fn, &ssb);
  fstat(fileno(f), &fsb);
  if (ssb.st_ino != fsb.st_ino) {
    /* report inconsistency error */

If there's further information that I can provide to help make sense of this
turn of events I would be glad to provide it.


Oliver Cook    Systems Administrator, Claranet UK
[EMAIL PROTECTED]                  020 7903 3065
[EMAIL PROTECTED] mailing list
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to