Pavel Filipensky wrote: > Hi Udo, > > how long have you been on vanilla Osol 2009.06 and how long on SRU u6 > before updating to SRU u7? Is it possible that SRU u6 has the issue as > well?
We apparently switched to SU5 on 21-Oct-09, to SU6 on 17-Dec-2009, and then to SU7+IDR30+IDR35 on 14-Jan-2010, after that these problems started to occur about once per 1-2 days. We had unexplained hangups before on Mondays, but with a much lower frequency, and probably related to zfs scrubs which triggered (via the ARC) swapping (which in turn was buggy as well...). I am now suspecting that maybe NFS is right when denying access through NFSV4ERR_NO_GRACE, since I found a DNS query related to nfs4v_mapid (we have not set it, all NFS mapping should be done on sys locally via /etc/hosts), and our local machines on the private net have access to dns through NAT (no ipf.conf entries), which seems ok, but a few machines with private AND public network access have an additional (unwanted) route on the private net to the public net via this NAT. This seems to result sometimes (don't know why) in a DNS query about the imksunxxx machine (which is a local entry in /etc/hosts on the private net), and the DNS query would return the short name resolved to imksunxxx.ourdomain.tld with a public IP, and, voila, NFS sees a different client under the same short name and must deny access due to ambiguity (but I suspect that this should be visible in the snoop as a FQHN?). Or, alternatively, the client is visible through NAT and the private net, and this could trigger the problem. It's still inconlusive to me. As a test I switched off NAT (we only need it for mail and updates), and the hangups are gone for 3 days now (but we will see). Maybe we need ipf.conf entries which filter out the private net traffic to prevent additional routes (or even loops?). > I was suspicious about one integration which went to SRUu2 and to S10U8 > and to snv_114, but the change is only on the nfs client side - this > does not match the set-up described earlier by Jorgen: > - clients are s10u5 > - server/s10u8 (issue) > - server/snv_117 (no issue) > > I am not able to find out more from the available data - unless the > problem is reproducible it is hard to diagnose. > I have made one observation (not sure if it is useful) > > reopen with CLAIM_PREVIOUS (CT=P) fails with NFS4ERR_NO_GRACE, > but reopen with CLAIM_NULL (CT=N) succeeds with NFS4_OK > > Pavel > > On 03/11/10 14:16, Udo Grabowski wrote: >> Hi Pavel, >> >> both clients and server were updated (we always have a consistent >> environment), and we came >> from u6, to which we updated directly from a vanilla Osol 2009.06 >> before. The Readmes don't list any NFS patches there, so I suspect >> that our IDR30 patch carries unwanted changes from Solaris u8 into >> Opensolaris U7 which trigger this problem. Since we don't use Solaris >> 10, I cannot >> confirm that 10u7 did not have that problem, I just concluded it from >> the initial post here (but >> that conclusion maybe wrong, I admit). >> We currently snoop the problem and catched some clues (maybe): Shortly >> before, we >> see CB_NULL and NULL4 exchanges, seemingly as a result of a client >> renewal (we do not catch everything before), both server and client >> seem to check their partners callback capabilities: >> ..... -- Dr.Udo Grabowski Inst.f.Meteorology a.Climate Research IMK-ASF-SAT www-imk.fzk.de/asf/sat/grabowski/ www.imk-asf.kit.edu/english/sat.php KIT - Karlsruhe Institute of Technology http://www.kit.edu Postfach 3640,76021 Karlsruhe,Germany Tel:(+49)7247 82-6026,Fax:-7026