On Thu, Dec 22, 2011 at 9:58 AM, Albert Lee <[email protected]> wrote: > On Thu, Dec 22, 2011 at 6:59 AM, Kasper Brink <[email protected]> wrote: >> Hello all, >> >> I'm testing a new fileserver running OI 151a, and I've run into a problem >> with an NFS4-mounted filesystem, on a Linux client, that stops responding. >> This happens after running a filebench workload on the client for several >> minutes. Metadata operations (ls, stat, rm, mkdir) still work, but anything >> that involves file contents (e.g. cat) blocks indefinitely. To get out of >> this state requires restarting the nfs service on the server (and waiting 2 >> minutes for recovery). The good news is that this problem is reproducible; >> details below. >> >> When the problem occurs, snoop shows the following being repeated: >> >> 0.00009 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () PUTFH FH=8775 SAVEFH >> OPEN 00000614 OT=NC SQ=0 CT=N AC=RW DN=N OO=3271 GETFH GETATTR 10011a >> 30a23a... >> 0.00022 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4ERR_STALE_CLIENTID >> PUTFH NFS4_OK SAVEFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID >> 0.00012 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () RENEW CL=654ee52ddc >> 0.00002 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4_OK RENEW NFS4_OK >> 0.00008 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () PUTFH FH=8775 SAVEFH >> OPEN 00000614 OT=NC SQ=0 CT=N AC=RW DN=N OO=3208 GETFH GETATTR 10011a >> 30a23a... >> 0.00019 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4ERR_STALE_CLIENTID >> PUTFH NFS4_OK SAVEFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID >> 0.00012 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () RENEW CL=654ee52ddc >> 0.00002 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4_OK RENEW NFS4_OK >> ... and so on. >> >> So the client is getting an NFS4ERR_STALE_CLIENTID on OPEN, it succesfully >> renews its clientid, and then immediately gets the same error again. This >> seems to be the same problem as described in >> http://thread.gmane.org/gmane.linux.nfs/44449 ; the conclusion of that >> thread was that this is not a bug in the Linux client. >> >> The filebench workload that reproduces this behaviour is just the >> creation/reuse of a fileset with many small files (no I/O flowops are >> needed). The problem always seems to occur after about 1m NFS4 OPEN ops. >> The same workload runs without issues when the client is OI 151a, or the >> server is Linux, or over NFS3. I don't think it is hardware related, >> because I get the same behaviour with Xen PV domains. >> >> Is this a known problem, or should I report it as a bug? Is there anything >> else I can do to help debug this? >> >> Regards, >> >> Kasper Brink >> >> >> >> Steps to reproduce >> ================== >> >> # On SERVER: >> >> # (Ramdisk-based pool is fastest, but disk-based works too) >> ramdiskadm -a tempdisk 256m >> zpool create temppool /dev/ramdisk/tempdisk >> zfs set sharenfs=rw=$CLIENT,root=$CLIENT temppool >> >> >> # On CLIENT: >> >> # I used Debian Squeeze (6.0.3), but I expect other distros will work as >> well. >> # uname -a : Linux basil 2.6.32-5-xen-amd64 #1 SMP Mon Oct 3 07:53:54 UTC >> 2011 x86_64 GNU/Linux >> # dpkg -l nfs-common : nfs-common 1:1.2.2-4 NFS support files common to >> client and server >> >> # Get filebench, either from distro, or download: >> # >> http://sourceforge.net/projects/filebench/files/filebench/filebench-1.4.9.1 >> # untar; ./configure && make && make install >> >> mkdir /mnt/temppool >> mount -t nfs4 -o rw,sync,hard $SERVER:/temppool /mnt/temppool >> # Check that filesystem is writeable >> touch /mnt/temppool/foo >> >> # Save nfs4test.f (below) to a file >> >> for i in $(seq 15); do echo ===== $i $(date); filebench -f nfs4test.f; done >> # The NFS4 mount should become unresponsive around iteration 8 or 9... >> >> >> ############################################################ >> # nfs4test.f (Filebench workload) >> ############################################################ >> >> set $dir=/mnt/temppool >> set $nfiles=128k >> set $filesize=1k >> >> define fileset name=nfs4test,path=$dir,size=$filesize,entries=$nfiles, >> filesizegamma=0,dirwidth=1000,prealloc,reuse >> >> define process name=dummy,instances=1 >> { >> thread name=dummy,memsize=1m,instances=1 >> { >> flowop finishoncount name=finishoncount,value=0 >> } >> } >> >> set mode quit firstdone >> run >> >> ############################################################ >> >> > > This is not a known issue for us, but might be related to resource > limitations in the NFSv4 server. Feel free to file a new bug. Can you > echo -e '::rfs4_db\n::rfs_client' | mdb -k before and after the test? > > -Albert
Sorry, I meant echo -e '::rfs4_db\n::rfs4_client' | mdb -k -Albert ------------------------------------------- illumos-discuss Archives: https://www.listbox.com/member/archive/182180/=now RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be Modify Your Subscription: https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4 Powered by Listbox: http://www.listbox.com
