On Thu, Dec 22, 2011 at 6:59 AM, Kasper Brink <[email protected]> wrote:
> Hello all,
>
> I'm testing a new fileserver running OI 151a, and I've run into a problem
> with an NFS4-mounted filesystem, on a Linux client, that stops responding.
> This happens after running a filebench workload on the client for several
> minutes. Metadata operations (ls, stat, rm, mkdir) still work, but anything
> that involves file contents (e.g. cat) blocks indefinitely. To get out of
> this state requires restarting the nfs service on the server (and waiting 2
> minutes for recovery). The good news is that this problem is reproducible;
> details below.
>
> When the problem occurs, snoop shows the following being repeated:
>
> 0.00009 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () PUTFH FH=8775 SAVEFH OPEN 
> 00000614 OT=NC SQ=0 CT=N AC=RW DN=N OO=3271 GETFH GETATTR 10011a 30a23a...
> 0.00022 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4ERR_STALE_CLIENTID 
> PUTFH NFS4_OK SAVEFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
> 0.00012 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () RENEW CL=654ee52ddc
> 0.00002 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4_OK RENEW NFS4_OK
> 0.00008 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () PUTFH FH=8775 SAVEFH OPEN 
> 00000614 OT=NC SQ=0 CT=N AC=RW DN=N OO=3208 GETFH GETATTR 10011a 30a23a...
> 0.00019 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4ERR_STALE_CLIENTID 
> PUTFH NFS4_OK SAVEFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
> 0.00012 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () RENEW CL=654ee52ddc
> 0.00002 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4_OK RENEW NFS4_OK
> ... and so on.
>
> So the client is getting an NFS4ERR_STALE_CLIENTID on OPEN, it succesfully
> renews its clientid, and then immediately gets the same error again. This
> seems to be the same problem as described in
> http://thread.gmane.org/gmane.linux.nfs/44449 ; the conclusion of that
> thread was that this is not a bug in the Linux client.
>
> The filebench workload that reproduces this behaviour is just the
> creation/reuse of a fileset with many small files (no I/O flowops are
> needed). The problem always seems to occur after about 1m NFS4 OPEN ops.
> The same workload runs without issues when the client is OI 151a, or the
> server is Linux, or over NFS3. I don't think it is hardware related,
> because I get the same behaviour with Xen PV domains.
>
> Is this a known problem, or should I report it as a bug? Is there anything
> else I can do to help debug this?
>
> Regards,
>
> Kasper Brink
>
>
>
>     Steps to reproduce
>     ==================
>
> # On SERVER:
>
> # (Ramdisk-based pool is fastest, but disk-based works too)
> ramdiskadm -a tempdisk 256m
> zpool create temppool /dev/ramdisk/tempdisk
> zfs set sharenfs=rw=$CLIENT,root=$CLIENT temppool
>
>
> # On CLIENT:
>
> # I used Debian Squeeze (6.0.3), but I expect other distros will work as well.
> # uname -a : Linux basil 2.6.32-5-xen-amd64 #1 SMP Mon Oct 3 07:53:54 UTC 
> 2011 x86_64 GNU/Linux
> # dpkg -l nfs-common :  nfs-common  1:1.2.2-4  NFS support files common to 
> client and server
>
> # Get filebench, either from distro, or download:
> #  http://sourceforge.net/projects/filebench/files/filebench/filebench-1.4.9.1
> #  untar; ./configure && make && make install
>
> mkdir /mnt/temppool
> mount -t nfs4 -o rw,sync,hard $SERVER:/temppool /mnt/temppool
> # Check that filesystem is writeable
> touch /mnt/temppool/foo
>
> # Save nfs4test.f (below) to a file
>
> for i in $(seq 15); do echo ===== $i $(date); filebench -f nfs4test.f; done
> # The NFS4 mount should become unresponsive around iteration 8 or 9...
>
>
> ############################################################
> # nfs4test.f   (Filebench workload)
> ############################################################
>
> set $dir=/mnt/temppool
> set $nfiles=128k
> set $filesize=1k
>
> define fileset name=nfs4test,path=$dir,size=$filesize,entries=$nfiles,
>               filesizegamma=0,dirwidth=1000,prealloc,reuse
>
> define process name=dummy,instances=1
> {
>  thread name=dummy,memsize=1m,instances=1
>  {
>    flowop finishoncount name=finishoncount,value=0
>  }
> }
>
> set mode quit firstdone
> run
>
> ############################################################
>
>

This is not a known issue for us, but might be related to resource
limitations in the NFSv4 server. Feel free to file a new bug. Can you
echo -e '::rfs4_db\n::rfs_client' | mdb -k before and after the test?

-Albert


-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Reply via email to