On Thu, Dec 22, 2011 at 9:58 AM, Albert Lee <[email protected]> wrote:
> On Thu, Dec 22, 2011 at 6:59 AM, Kasper Brink <[email protected]> wrote:
>> Hello all,
>>
>> I'm testing a new fileserver running OI 151a, and I've run into a problem
>> with an NFS4-mounted filesystem, on a Linux client, that stops responding.
>> This happens after running a filebench workload on the client for several
>> minutes. Metadata operations (ls, stat, rm, mkdir) still work, but anything
>> that involves file contents (e.g. cat) blocks indefinitely. To get out of
>> this state requires restarting the nfs service on the server (and waiting 2
>> minutes for recovery). The good news is that this problem is reproducible;
>> details below.
>>
>> When the problem occurs, snoop shows the following being repeated:
>>
>> 0.00009 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () PUTFH FH=8775 SAVEFH 
>> OPEN 00000614 OT=NC SQ=0 CT=N AC=RW DN=N OO=3271 GETFH GETATTR 10011a 
>> 30a23a...
>> 0.00022 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4ERR_STALE_CLIENTID 
>> PUTFH NFS4_OK SAVEFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
>> 0.00012 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () RENEW CL=654ee52ddc
>> 0.00002 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4_OK RENEW NFS4_OK
>> 0.00008 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () PUTFH FH=8775 SAVEFH 
>> OPEN 00000614 OT=NC SQ=0 CT=N AC=RW DN=N OO=3208 GETFH GETATTR 10011a 
>> 30a23a...
>> 0.00019 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4ERR_STALE_CLIENTID 
>> PUTFH NFS4_OK SAVEFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
>> 0.00012 basil.cs.ru.nl -> thyme.cs.ru.nl NFS C 4 () RENEW CL=654ee52ddc
>> 0.00002 thyme.cs.ru.nl -> basil.cs.ru.nl NFS R 4 () NFS4_OK RENEW NFS4_OK
>> ... and so on.
>>
>> So the client is getting an NFS4ERR_STALE_CLIENTID on OPEN, it succesfully
>> renews its clientid, and then immediately gets the same error again. This
>> seems to be the same problem as described in
>> http://thread.gmane.org/gmane.linux.nfs/44449 ; the conclusion of that
>> thread was that this is not a bug in the Linux client.
>>
>> The filebench workload that reproduces this behaviour is just the
>> creation/reuse of a fileset with many small files (no I/O flowops are
>> needed). The problem always seems to occur after about 1m NFS4 OPEN ops.
>> The same workload runs without issues when the client is OI 151a, or the
>> server is Linux, or over NFS3. I don't think it is hardware related,
>> because I get the same behaviour with Xen PV domains.
>>
>> Is this a known problem, or should I report it as a bug? Is there anything
>> else I can do to help debug this?
>>
>> Regards,
>>
>> Kasper Brink
>>
>>
>>
>>     Steps to reproduce
>>     ==================
>>
>> # On SERVER:
>>
>> # (Ramdisk-based pool is fastest, but disk-based works too)
>> ramdiskadm -a tempdisk 256m
>> zpool create temppool /dev/ramdisk/tempdisk
>> zfs set sharenfs=rw=$CLIENT,root=$CLIENT temppool
>>
>>
>> # On CLIENT:
>>
>> # I used Debian Squeeze (6.0.3), but I expect other distros will work as 
>> well.
>> # uname -a : Linux basil 2.6.32-5-xen-amd64 #1 SMP Mon Oct 3 07:53:54 UTC 
>> 2011 x86_64 GNU/Linux
>> # dpkg -l nfs-common :  nfs-common  1:1.2.2-4  NFS support files common to 
>> client and server
>>
>> # Get filebench, either from distro, or download:
>> #  
>> http://sourceforge.net/projects/filebench/files/filebench/filebench-1.4.9.1
>> #  untar; ./configure && make && make install
>>
>> mkdir /mnt/temppool
>> mount -t nfs4 -o rw,sync,hard $SERVER:/temppool /mnt/temppool
>> # Check that filesystem is writeable
>> touch /mnt/temppool/foo
>>
>> # Save nfs4test.f (below) to a file
>>
>> for i in $(seq 15); do echo ===== $i $(date); filebench -f nfs4test.f; done
>> # The NFS4 mount should become unresponsive around iteration 8 or 9...
>>
>>
>> ############################################################
>> # nfs4test.f   (Filebench workload)
>> ############################################################
>>
>> set $dir=/mnt/temppool
>> set $nfiles=128k
>> set $filesize=1k
>>
>> define fileset name=nfs4test,path=$dir,size=$filesize,entries=$nfiles,
>>               filesizegamma=0,dirwidth=1000,prealloc,reuse
>>
>> define process name=dummy,instances=1
>> {
>>  thread name=dummy,memsize=1m,instances=1
>>  {
>>    flowop finishoncount name=finishoncount,value=0
>>  }
>> }
>>
>> set mode quit firstdone
>> run
>>
>> ############################################################
>>
>>
>
> This is not a known issue for us, but might be related to resource
> limitations in the NFSv4 server. Feel free to file a new bug. Can you
> echo -e '::rfs4_db\n::rfs_client' | mdb -k before and after the test?
>
> -Albert

Sorry, I meant echo -e '::rfs4_db\n::rfs4_client' | mdb -k

-Albert


-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Reply via email to