[nfs-discuss] data corruption with NFS4/ZFS

Robert Thurlow Wed, 27 May 2009 21:46:54 -0600

Udo Grabowski wrote:
> We are constantly experiencing data corruptions while writing (different) 
> files from a lot of
> clients (~100) to one ZFS pool NFS v4 mounted from a single machine.  
> Occasionally (1 
> out of a couple of 10000 writes) the files have a few ASCII 0 values in 
> sequence in the 
> middle of the file.


This sounds like a bug.  In the past, when we've had zero-filled
regions that shouldn't be, it's because of bugs in keeping track
of the end of file offset, and mostly on the client.  To track it
down, do you see this with NFSv3 also?

> Testet with SXDE 1/08 (79b) and recent Osol dev 06.09 (111a), host is a 
> X4600M2 with a 2x146GB SAS disk stripe (just testing), clients are Ultra20 
> M1.M2, 
> X4600, Blade X8420.

Just to be clear - have you tested those Solaris versions on
both the clients and server?  I think the version of the client
is most important here.  snv_79b is pretty old, and I know some
good fixes went into both the client and server.  Thanks for
testing with OpenSolaris, of course - that's very helpful.

 > This has been tested with a script (writing ASCII zeros to files to
> generate load)
> 
> #!/bin/sh
> WORK=/myworkdir  ; nfs mounted dir somewhere
> echo Begin: `date`
> for i in `seq 8192`; do
>       dd if=/dev/zero of=$WORK/test_nfs.${HOSTNAME}_$$ obs=1b count=4
> done
> echo Done: `date`
> 
> and sent via SGE gridengine (120-150 jobs, about 100 running at the same 
> time),
> where the log ouput of this script also goes to that NFS mounted directory
> (runs about 2 hours on a 1 Gbe switched fiber network). 
> The problems appear in the log outputs (or data files in our other 
> applications), 
> and there are also not always 8192 dd instances recorded, but often 2 or 3 
> less, 
> although the jobs have run without failure.

This is interesting, and also sounds like a bug, which could
be related to the above or not.  I like it because it creates
something simpler (modulo the many SGE instances :-).  Again,
does this reproduce with NFSv3 also?

> Any clues what's wrong here ? My understanding of the semantics of NFS 
> implies that
> at least 1. should never happen on clean hardware, regardless if there are 
> timeouts 
> or drops or whatever. It looks to me like that this is a hidden bug somewhere 
> in 
> NFS or ZFS only visible under constant high load.

I would tend to agree.  Trying this on a DEBUG kernel would be
interesting - we might see DEBUG-only complaints that you are
not seeing on a non-DEBUG kernel.

Rob T

[nfs-discuss] data corruption with NFS4/ZFS

Reply via email to