On Thu, Dec 22, 2011 at 6:59 AM, Kasper Brink <[email protected]> wrote: > Hello all, > > I'm testing a new fileserver running OI 151a, and I've run into a problem > with an NFS4-mounted filesystem, on a Linux client, that stops responding. > This happens after running a filebench workload on the client for several > minutes. Metadata operations (ls, stat, rm, mkdir) still work, but anything > that involves file contents (e.g. cat) blocks indefinitely. To get out of > this state requires restarting the nfs service on the server (and waiting 2 > minutes for recovery). The good news is that this problem is reproducible; > details below. > > When the problem occurs, snoop shows the following being repeated: > > 0.00009 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () PUTFH FH=8775 SAVEFH OPEN > 00000614 OT=NC SQ=0 CT=N AC=RW DN=N OO=3271 GETFH GETATTR 10011a 30a23a... > 0.00022 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4ERR_STALE_CLIENTID > PUTFH NFS4_OK SAVEFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID > 0.00012 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () RENEW CL=654ee52ddc > 0.00002 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4_OK RENEW NFS4_OK > 0.00008 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () PUTFH FH=8775 SAVEFH OPEN > 00000614 OT=NC SQ=0 CT=N AC=RW DN=N OO=3208 GETFH GETATTR 10011a 30a23a... > 0.00019 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4ERR_STALE_CLIENTID > PUTFH NFS4_OK SAVEFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID > 0.00012 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () RENEW CL=654ee52ddc > 0.00002 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4_OK RENEW NFS4_OK > ... and so on. > > So the client is getting an NFS4ERR_STALE_CLIENTID on OPEN, it succesfully > renews its clientid, and then immediately gets the same error again. This > seems to be the same problem as described in > http://thread.gmane.org/gmane.linux.nfs/44449 ; the conclusion of that > thread was that this is not a bug in the Linux client. > > The filebench workload that reproduces this behaviour is just the > creation/reuse of a fileset with many small files (no I/O flowops are > needed). The problem always seems to occur after about 1m NFS4 OPEN ops. > The same workload runs without issues when the client is OI 151a, or the > server is Linux, or over NFS3. I don't think it is hardware related, > because I get the same behaviour with Xen PV domains. > > Is this a known problem, or should I report it as a bug? Is there anything > else I can do to help debug this? > > Regards, > > Kasper Brink > > > > Steps to reproduce > ================== > > # On SERVER: > > # (Ramdisk-based pool is fastest, but disk-based works too) > ramdiskadm -a tempdisk 256m > zpool create temppool /dev/ramdisk/tempdisk > zfs set sharenfs=rw=$CLIENT,root=$CLIENT temppool > > > # On CLIENT: > > # I used Debian Squeeze (6.0.3), but I expect other distros will work as well. > # uname -a : Linux basil 2.6.32-5-xen-amd64 #1 SMP Mon Oct 3 07:53:54 UTC > 2011 x86_64 GNU/Linux > # dpkg -l nfs-common : nfs-common 1:1.2.2-4 NFS support files common to > client and server > > # Get filebench, either from distro, or download: > # http://sourceforge.net/projects/filebench/files/filebench/filebench-1.4.9.1 > # untar; ./configure && make && make install > > mkdir /mnt/temppool > mount -t nfs4 -o rw,sync,hard $SERVER:/temppool /mnt/temppool > # Check that filesystem is writeable > touch /mnt/temppool/foo > > # Save nfs4test.f (below) to a file > > for i in $(seq 15); do echo ===== $i $(date); filebench -f nfs4test.f; done > # The NFS4 mount should become unresponsive around iteration 8 or 9... > > > ############################################################ > # nfs4test.f (Filebench workload) > ############################################################ > > set $dir=/mnt/temppool > set $nfiles=128k > set $filesize=1k > > define fileset name=nfs4test,path=$dir,size=$filesize,entries=$nfiles, > filesizegamma=0,dirwidth=1000,prealloc,reuse > > define process name=dummy,instances=1 > { > thread name=dummy,memsize=1m,instances=1 > { > flowop finishoncount name=finishoncount,value=0 > } > } > > set mode quit firstdone > run > > ############################################################ > >
This is not a known issue for us, but might be related to resource limitations in the NFSv4 server. Feel free to file a new bug. Can you echo -e '::rfs4_db\n::rfs_client' | mdb -k before and after the test? -Albert ------------------------------------------- illumos-discuss Archives: https://www.listbox.com/member/archive/182180/=now RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be Modify Your Subscription: https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4 Powered by Listbox: http://www.listbox.com
