Opensolaris 2009.6 snv 111b BOTH client and server
We have a nested mirrormount NFSv4 over 1Gb/s tcp. Running only 30 jobs on a 
single host
 (X4600M2) via SGE intermittently (30% of all jobs, sometimes even more !) 
gives Shepherd error:

can't stat() "/home/Processor/xxx/yyy/zzz/..../aaa.log" as stdout_path: Device 
busy KRB5CCAME=none ....

We've not seen this without mirrormounts. Could it be that there's a single 
threaded bottleneck
which triggers this error (I've seen a similar problem with ramdiskadm, you 
can't create more
than one ramdisk at a time with two parallel commands, it also gives 'device 
busy'). ?

The server is completely idle (a X4540 with 8 x (5 Raid5) ZFS stripe, capable 
of 6000 iops/s, 
650 MB/s measured write throughput),
no errors in nfsstat -s. nfsstat -c on the client host gives some 
badcalls,badxids,timeouts:

Client rpc:
Connection oriented:
calls      badcalls   badxids    timeouts   newcreds   badverfs   timers     
152269965  1376       34         87         0          0          0          
Client nfs:
calls     badcalls  clgets    cltoomany 
152268831 1255      152268765 69        

tuned already, but no help: 
ncsize=0x100000 
nfs:nfs4_bsize=0x100000
tcp_xmit_hiwat, tcp_recv_hiwat=1024000
NFSD_LISTEN_BACKLOG=600
NFSD_SERVERS=600
LOCKD_LISTEN_BACKLOG=600
NFS_SERVER_DELEGATION=off

We're trying to run NFSv4 for over a year now, and no OS version is able to 
deliver without severe
problems (see,e.g.,my earlier post 
<http://www.opensolaris.org/jive/thread.jspa?threadID=103920&tstart=45>,
  "data corruption with NFS4/ZFS",  that symptom is now worse with both server 
and client on OSol).
Will NFSv4 be a sandbox for developers forever or is it planned to make it 
usable for enterprise 
production some day ? At least, there should be a statement in the release 
notes that NFSv4 is not
usable for production yet because of multiple issues. Do we have again to go 
back to v3 and wait 
until v5 will give better experience ? Or is NFS development effectively dead 
in opensolaris because
Osol now concentrates on the "typical desktop user" who does not need NFS ? 
No, we do not plan to step back to solaris 10 ("./configure; => Error: at least 
version xxx of package yyy needed ..." => hopelessly outdated software versions 
for our needs !), and we really need v4 
with mirrormount capability for our complex and dynamic directory layouts (too 
much work on v3 to
keep up the automount tables, bad performance for ~300 clients). 

Sorry for ranting, but I get really upset because any new release of SXDE, 
Opensolaris or
whatever gives us another step back instead of forward. We are not developing 
OSol, but we need to
work with it ! You should really concentrate on consolidating code instead of 
incorporating more and
more new stuff which does not really work when it comes to real production. 
There are so many
open bugs in this area.

automount entries:
auto_master:
/home      /imksun/auto_home_SunOS        -tcp,rw,intr,noquota,actimeo=1,bg
auto_home_SunOS:
Processor       -fstype=autofs  auto_Processor_SunOS
auto_Processor_SunOS:
xxx           server3:/Work_Pool/&

The mount options are (nfsstat -m)
/home/Processor/xxx from server3:/Work_Pool/xxx   Flags:         
vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=1048576,wsize=1048576,
retrans=5,timeo=600 Attr cache:    acregmin=1,acregmax=1,acdirmin=1,acdirmax=1

/home/Processor/xxx/yyy  from server3:/Work_Pool/xxx/yyy Flags:         
vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,mirrormount,rsize=1048576,wsize=1048576,
retrans=5,timeo=600 Attr cache:    acregmin=1,acregmax=1,acdirmin=1,acdirmax=1

/home/Processor/xxx/yyy/zzz from imksunth3:/Work_Pool/xxx/yyy/zzz  Flags:       
  
vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,mirrormount,rsize=1048576,wsize=1048576,
retrans=5,timeo=600 Attr cache:    acregmin=1,acregmax=1,acdirmin=1,acdirmax=1/

As seen from mount (note it does not show tcp/etc. for the mirror mounts!):
/home/Processor/xxx on server3:/Work_Pool/xxx 
remote/read/write/setuid/devices/tcp/intr/noquota/actimeo=1/bg/xattr/dev=5303212
 on Mon Aug 17 14:18:47 2009
/home/Processor/xxx/yyy on server3:/Work_Pool/xxx/yyy 
remote/read/write/setuid/devices/xattr/dev=5303213 on Mon Aug 17 14:18:56 2009
/home/Processor/xxx/yyy/zzz on server3:/Work_Pool/xxx/yyy/zzz 
remote/read/write/setuid/devices/xattr/dev=5303214 on Mon Aug 17 14:18:56 2009
-- 
This message posted from opensolaris.org

Reply via email to