Opensolaris 2009.6 snv 111b BOTH client and server We have a nested mirrormount NFSv4 over 1Gb/s tcp. Running only 30 jobs on a single host (X4600M2) via SGE intermittently (30% of all jobs, sometimes even more !) gives Shepherd error:
can't stat() "/home/Processor/xxx/yyy/zzz/..../aaa.log" as stdout_path: Device busy KRB5CCAME=none .... We've not seen this without mirrormounts. Could it be that there's a single threaded bottleneck which triggers this error (I've seen a similar problem with ramdiskadm, you can't create more than one ramdisk at a time with two parallel commands, it also gives 'device busy'). ? The server is completely idle (a X4540 with 8 x (5 Raid5) ZFS stripe, capable of 6000 iops/s, 650 MB/s measured write throughput), no errors in nfsstat -s. nfsstat -c on the client host gives some badcalls,badxids,timeouts: Client rpc: Connection oriented: calls badcalls badxids timeouts newcreds badverfs timers 152269965 1376 34 87 0 0 0 Client nfs: calls badcalls clgets cltoomany 152268831 1255 152268765 69 tuned already, but no help: ncsize=0x100000 nfs:nfs4_bsize=0x100000 tcp_xmit_hiwat, tcp_recv_hiwat=1024000 NFSD_LISTEN_BACKLOG=600 NFSD_SERVERS=600 LOCKD_LISTEN_BACKLOG=600 NFS_SERVER_DELEGATION=off We're trying to run NFSv4 for over a year now, and no OS version is able to deliver without severe problems (see,e.g.,my earlier post <http://www.opensolaris.org/jive/thread.jspa?threadID=103920&tstart=45>, "data corruption with NFS4/ZFS", that symptom is now worse with both server and client on OSol). Will NFSv4 be a sandbox for developers forever or is it planned to make it usable for enterprise production some day ? At least, there should be a statement in the release notes that NFSv4 is not usable for production yet because of multiple issues. Do we have again to go back to v3 and wait until v5 will give better experience ? Or is NFS development effectively dead in opensolaris because Osol now concentrates on the "typical desktop user" who does not need NFS ? No, we do not plan to step back to solaris 10 ("./configure; => Error: at least version xxx of package yyy needed ..." => hopelessly outdated software versions for our needs !), and we really need v4 with mirrormount capability for our complex and dynamic directory layouts (too much work on v3 to keep up the automount tables, bad performance for ~300 clients). Sorry for ranting, but I get really upset because any new release of SXDE, Opensolaris or whatever gives us another step back instead of forward. We are not developing OSol, but we need to work with it ! You should really concentrate on consolidating code instead of incorporating more and more new stuff which does not really work when it comes to real production. There are so many open bugs in this area. automount entries: auto_master: /home /imksun/auto_home_SunOS -tcp,rw,intr,noquota,actimeo=1,bg auto_home_SunOS: Processor -fstype=autofs auto_Processor_SunOS auto_Processor_SunOS: xxx server3:/Work_Pool/& The mount options are (nfsstat -m) /home/Processor/xxx from server3:/Work_Pool/xxx Flags: vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=1048576,wsize=1048576, retrans=5,timeo=600 Attr cache: acregmin=1,acregmax=1,acdirmin=1,acdirmax=1 /home/Processor/xxx/yyy from server3:/Work_Pool/xxx/yyy Flags: vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,mirrormount,rsize=1048576,wsize=1048576, retrans=5,timeo=600 Attr cache: acregmin=1,acregmax=1,acdirmin=1,acdirmax=1 /home/Processor/xxx/yyy/zzz from imksunth3:/Work_Pool/xxx/yyy/zzz Flags: vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,mirrormount,rsize=1048576,wsize=1048576, retrans=5,timeo=600 Attr cache: acregmin=1,acregmax=1,acdirmin=1,acdirmax=1/ As seen from mount (note it does not show tcp/etc. for the mirror mounts!): /home/Processor/xxx on server3:/Work_Pool/xxx remote/read/write/setuid/devices/tcp/intr/noquota/actimeo=1/bg/xattr/dev=5303212 on Mon Aug 17 14:18:47 2009 /home/Processor/xxx/yyy on server3:/Work_Pool/xxx/yyy remote/read/write/setuid/devices/xattr/dev=5303213 on Mon Aug 17 14:18:56 2009 /home/Processor/xxx/yyy/zzz on server3:/Work_Pool/xxx/yyy/zzz remote/read/write/setuid/devices/xattr/dev=5303214 on Mon Aug 17 14:18:56 2009 -- This message posted from opensolaris.org