Pavel Filipensky wrote:
> Hi Udo,
> 
> how long have you been on vanilla Osol 2009.06 and how long on SRU u6 
> before updating to SRU u7? Is it possible that SRU u6 has the issue as 
> well?

We apparently switched to SU5 on 21-Oct-09, to SU6 on 17-Dec-2009,
and then to SU7+IDR30+IDR35 on 14-Jan-2010, after that these problems
started to occur about once per 1-2 days. We had unexplained hangups
before on Mondays, but with a much lower frequency, and probably related to
zfs scrubs which triggered (via the ARC) swapping (which in turn was buggy
as well...).

I am now suspecting that maybe NFS is right when denying access through
NFSV4ERR_NO_GRACE, since I found a DNS query related to nfs4v_mapid (we
have not set it, all NFS mapping should be done on sys locally via /etc/hosts),
and our local machines on the private net have access to dns through
NAT (no ipf.conf entries), which seems ok, but a few machines with
private AND public network access have an additional (unwanted) route on the 
private net to the public net via this NAT. This seems to result sometimes
(don't know why) in a DNS query about the imksunxxx machine (which is a local
entry in /etc/hosts on the private net), and the DNS query would return
the short name resolved to imksunxxx.ourdomain.tld with a public IP, and,
voila, NFS sees a different client under the same short name and must deny
access due to ambiguity (but I suspect that this should be visible in the
snoop as a FQHN?). Or, alternatively, the client is visible through
NAT and the private net, and this could trigger the problem. It's still
inconlusive to me.

As a test I switched off NAT (we only need it for mail and updates), and
the hangups are gone for 3 days now (but we will see). Maybe we need
ipf.conf entries which filter out the private net traffic to prevent
additional routes (or even loops?).

> I was suspicious about one integration which went to SRUu2 and to S10U8 
> and to snv_114, but the change is only on the nfs client side - this 
> does not match the set-up described earlier by Jorgen:
> - clients are s10u5
> - server/s10u8 (issue)
> - server/snv_117 (no issue)
> 
> I am not able to find out more from the available data - unless the 
> problem is reproducible it is hard to diagnose.
> I have made one observation (not sure if it is useful)
> 
> reopen with CLAIM_PREVIOUS (CT=P) fails with NFS4ERR_NO_GRACE,
> but reopen with CLAIM_NULL (CT=N) succeeds with NFS4_OK
> 
> Pavel
> 
> On 03/11/10 14:16, Udo Grabowski wrote:
>> Hi Pavel,
>>
>> both clients and server were updated (we always have a consistent 
>> environment), and we came
>> from u6, to which we updated directly from a vanilla Osol 2009.06 
>> before. The Readmes don't list any NFS patches there, so I suspect 
>> that our IDR30 patch carries unwanted changes from Solaris u8 into 
>> Opensolaris U7 which trigger this problem. Since we don't use Solaris 
>> 10, I cannot
>> confirm that 10u7 did not have that problem, I just concluded it from 
>> the initial post here (but
>> that conclusion maybe wrong, I admit).
>> We currently snoop the problem and catched some clues (maybe): Shortly 
>> before, we
>> see CB_NULL and NULL4 exchanges, seemingly as a result of a client 
>> renewal (we do not catch everything before), both server and client 
>> seem to check their partners callback capabilities:
 >> .....
-- 
Dr.Udo Grabowski    Inst.f.Meteorology a.Climate Research IMK-ASF-SAT
www-imk.fzk.de/asf/sat/grabowski/ www.imk-asf.kit.edu/english/sat.php
KIT - Karlsruhe Institute of Technology            http://www.kit.edu
Postfach 3640,76021 Karlsruhe,Germany Tel:(+49)7247 82-6026,Fax:-7026

Reply via email to