Re: nfs lockd errors after NetApp software upgrade.

2020-01-09 Thread Daniel Braniss


> On 9 Jan 2020, at 05:24, Rick Macklem  wrote:
> 
> The attached patch changes the xid to be a global for all "connections" for
> the krpc UDP client.
> 
> You could try it if you'd like. It passed a trivial test, but I don't know why
> there is that "misfeature" comment means, so I don't know if this breaks that.
> 
> I can't think of why "xid" would have been per-connection (especially since a
> connection is a questionable concept for UDP), except that this might have
> originated in a userland library and carried into the kernel during porting.
> 
> rick


I will try it ASAP, in the meantime the new behavior of the NetAPP has been 
disabled,
and since I still don’t know what is causing the unexplained huge number of 
unlock requests,
it’s going to be a long debug process.
also, I will see how to switch to TCP for the NLM protocol with minor 
disruption.

thanks,
danny

[…] 
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: nfs lockd errors after NetApp software upgrade.

2020-01-08 Thread Rick Macklem
The attached patch changes the xid to be a global for all "connections" for
the krpc UDP client.

You could try it if you'd like. It passed a trivial test, but I don't know why
there is that "misfeature" comment means, so I don't know if this breaks that.

I can't think of why "xid" would have been per-connection (especially since a
connection is a questionable concept for UDP), except that this might have
originated in a userland library and carried into the kernel during porting.

rick


From: Daniel Braniss 
Sent: Wednesday, January 8, 2020 12:08 PM
To: Rick Macklem
Cc: Richard P Mackerras; Adam McDougall; freebsd-stable@freebsd.org
Subject: Re: nfs lockd errors after NetApp software upgrade.

top posting NetAPP reply:
…
Here you can see transaction ID (0x5e15f77a) being used over port 886 and the 
NFS server successfully responds.

44806952020-01-08 12:20:54   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   V4 UNLOCK Call (Reply In 4480696) FH:0x54b075a0 svid:13629 pos:0-0
44806962020-01-08 12:20:54   132.65.60.56
132.65.116.111 NLM  0x5e15f77a (1578497914) 4045
   V4 UNLOCK Reply (Call In 4480695)

Here you see that 2 minutes later the client uses the same transaction ID 
(0x5e15f77a) and the same port again, but the file handle is different, so the 
client is unlocking a different file.

45911362020-01-08 12:22:54   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
45925882020-01-08 12:22:57   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
45988622020-01-08 12:23:03   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
46088712020-01-08 12:23:21   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
46359842020-01-08 12:23:59   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0

transaction ID reuse is also seen for a number of other transaction IDs 
starting at the same time.

Withing ONTAP 9.3 we have changed the way our Replay-Cache tracks requests by 
including a checksum of the RPC request. Both in in this and earlier releases 
ONTAP would cache the call in frame 4480695, but starintg in 9.3 we then cache 
the checksum as part of that.

When the client sends the request in frame 4591136 it uses the same transaction 
ID (0x5e15f77a) and same port again. Here the problem is that we already hold a 
checksum in cache for the “same transaction”
 …

this seems to be happening after the client did not receive the response and 
re-transmits the request.

danny


On 24 Dec 2019, at 5:02, Rick Macklem 
mailto:rmack...@uoguelph.ca>> wrote:

Richard P Mackerras wrote:
Hi,

We had some bully type workloads emerge when we moved a lot of block
storage from old XIV to new all flash 3PAR. I wonder if your IMAP issue
might have emerged just because suddenly there was the opportunity with all
flash. QOS is good on 9.x ONTAP. If anyone says it’s not then they last
looked on 8.x. So I suggest you QOS the IMAP workload.

Nobody should be using UDP with NFS unless they have a very specific set
of circumstances. TCP was a real step forward.
Well, I can't argue with this, considering I did the first working 
implementation
of NFS over TCP. It was actually Mike Karels that suggested I try doing so,
There's a paper in a very old Usenix Conference Proceedings, but it is so old
that it isn't on the Usenix web page (around 1988 in Denver, if I recall).  I 
don't
even have a copy myself, although I was the author.

Now, having said that, I must note that the Network Lock Manager (NLM) and
Network Status Monitor (NSM) were not NFS. They were separate stateful
protocols (poorly designed imho) that Sun never published.

NFS as Sun designed it (NFSv2 and NFSv3) were "stateless server" protocols,
so that they could work reliably without server crash recovery.
However, the NLM was inherently stateful, since it was dealing with file locks.

So, you can't really lump the NLM with NFS (and you should avoid use of the
NLM over any transport imho).

NFSv4 tackl

Re: nfs lockd errors after NetApp software upgrade.

2020-01-08 Thread Rick Macklem
I hope you don't mind the top post, but...
Here's a snippet of code from the krpc (I wasn't the author):
if (stat == RPC_TIMEDOUT) {
/*
 * Check for async send misfeature for NLM
 * protocol.
 */
if ((rc->rc_timeout.tv_sec == 0
&& rc->rc_timeout.tv_usec == 0)
|| (rc->rc_timeout.tv_sec == -1
&& utimeout.tv_sec == 0
&& utimeout.tv_usec == 0)) {
CLNT_RELEASE(client);
break;
}
}
This causes the xid to be reinitialized when a timeout occurs.
The reinitialization uses __RPC_GETXID() and it does an exclusive or of
pid ^ time.sec ^ time.usec
so it shouldn't end up the same anyhow.
(Normally this initialization only occurs once, but because of the above, it
 could happen multiple times for the NLM. What does "async misfeature"
mean? I have no idea.

If by "transaction id" they are referring to the svid in the lock RPC message,
I have no idea if it should be unique for lock ops on different files.
What does the spec. say? No idea, since there is no such thing.

Anyhow, using TCP will avoid the DRC and whatever the Netapp filer
thinks w.r.t. the uniqueness of this field.

rick


From: Daniel Braniss 
Sent: Wednesday, January 8, 2020 12:08 PM
To: Rick Macklem
Cc: Richard P Mackerras; Adam McDougall; freebsd-stable@freebsd.org
Subject: Re: nfs lockd errors after NetApp software upgrade.

top posting NetAPP reply:
…
Here you can see transaction ID (0x5e15f77a) being used over port 886 and the 
NFS server successfully responds.

44806952020-01-08 12:20:54   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   V4 UNLOCK Call (Reply In 4480696) FH:0x54b075a0 svid:13629 pos:0-0
44806962020-01-08 12:20:54   132.65.60.56
132.65.116.111 NLM  0x5e15f77a (1578497914) 4045
   V4 UNLOCK Reply (Call In 4480695)

Here you see that 2 minutes later the client uses the same transaction ID 
(0x5e15f77a) and the same port again, but the file handle is different, so the 
client is unlocking a different file.

45911362020-01-08 12:22:54   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
45925882020-01-08 12:22:57   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
45988622020-01-08 12:23:03   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
46088712020-01-08 12:23:21   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
46359842020-01-08 12:23:59   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0

transaction ID reuse is also seen for a number of other transaction IDs 
starting at the same time.

Withing ONTAP 9.3 we have changed the way our Replay-Cache tracks requests by 
including a checksum of the RPC request. Both in in this and earlier releases 
ONTAP would cache the call in frame 4480695, but starintg in 9.3 we then cache 
the checksum as part of that.

When the client sends the request in frame 4591136 it uses the same transaction 
ID (0x5e15f77a) and same port again. Here the problem is that we already hold a 
checksum in cache for the “same transaction”
 …

this seems to be happening after the client did not receive the response and 
re-transmits the request.

danny


On 24 Dec 2019, at 5:02, Rick Macklem 
mailto:rmack...@uoguelph.ca>> wrote:

Richard P Mackerras wrote:
Hi,

We had some bully type workloads emerge when we moved a lot of block
storage from old XIV to new all flash 3PAR. I wonder if your IMAP issue
might have emerged just because suddenly there was the opportunity with all
flash. QOS is good on 9.x ONTAP. If anyone says it’s not then they last
looked on 8.x. So I suggest you QOS the IMAP workload.

Nobody should be using UDP with NFS unless they have a very specific set
o

Re: nfs lockd errors after NetApp software upgrade.

2020-01-08 Thread Rick Macklem
Switch to using TCP should avoid the DRC crap. (Most systems except FreeBSD 
only do
DRC for UDP.)

I assume that by "transaction ID", they are referring to the XID in the RPC 
header.
(I'll take a look at how it is maintained for UDP in the krpc. Btw, although 
their code
expecting it to change for a different RPC isn't surprising, the xid's 
behaviour is
"underspecified" in the Sun RPC RFC.)

rick


From: Daniel Braniss 
Sent: Wednesday, January 8, 2020 12:08 PM
To: Rick Macklem
Cc: Richard P Mackerras; Adam McDougall; freebsd-stable@freebsd.org
Subject: Re: nfs lockd errors after NetApp software upgrade.

top posting NetAPP reply:
…
Here you can see transaction ID (0x5e15f77a) being used over port 886 and the 
NFS server successfully responds.

44806952020-01-08 12:20:54   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   V4 UNLOCK Call (Reply In 4480696) FH:0x54b075a0 svid:13629 pos:0-0
44806962020-01-08 12:20:54   132.65.60.56
132.65.116.111 NLM  0x5e15f77a (1578497914) 4045
   V4 UNLOCK Reply (Call In 4480695)

Here you see that 2 minutes later the client uses the same transaction ID 
(0x5e15f77a) and the same port again, but the file handle is different, so the 
client is unlocking a different file.

45911362020-01-08 12:22:54   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
45925882020-01-08 12:22:57   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
45988622020-01-08 12:23:03   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
46088712020-01-08 12:23:21   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
46359842020-01-08 12:23:59   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0

transaction ID reuse is also seen for a number of other transaction IDs 
starting at the same time.

Withing ONTAP 9.3 we have changed the way our Replay-Cache tracks requests by 
including a checksum of the RPC request. Both in in this and earlier releases 
ONTAP would cache the call in frame 4480695, but starintg in 9.3 we then cache 
the checksum as part of that.

When the client sends the request in frame 4591136 it uses the same transaction 
ID (0x5e15f77a) and same port again. Here the problem is that we already hold a 
checksum in cache for the “same transaction”
 …

this seems to be happening after the client did not receive the response and 
re-transmits the request.

danny


On 24 Dec 2019, at 5:02, Rick Macklem 
mailto:rmack...@uoguelph.ca>> wrote:

Richard P Mackerras wrote:
Hi,

We had some bully type workloads emerge when we moved a lot of block
storage from old XIV to new all flash 3PAR. I wonder if your IMAP issue
might have emerged just because suddenly there was the opportunity with all
flash. QOS is good on 9.x ONTAP. If anyone says it’s not then they last
looked on 8.x. So I suggest you QOS the IMAP workload.

Nobody should be using UDP with NFS unless they have a very specific set
of circumstances. TCP was a real step forward.
Well, I can't argue with this, considering I did the first working 
implementation
of NFS over TCP. It was actually Mike Karels that suggested I try doing so,
There's a paper in a very old Usenix Conference Proceedings, but it is so old
that it isn't on the Usenix web page (around 1988 in Denver, if I recall).  I 
don't
even have a copy myself, although I was the author.

Now, having said that, I must note that the Network Lock Manager (NLM) and
Network Status Monitor (NSM) were not NFS. They were separate stateful
protocols (poorly designed imho) that Sun never published.

NFS as Sun designed it (NFSv2 and NFSv3) were "stateless server" protocols,
so that they could work reliably without server crash recovery.
However, the NLM was inherently stateful, since it was dealing with file locks.

So, you can't really lump the NLM with NFS (and you should avoid use of the
NLM over any transport imho).

NFSv4 tackled the difficult problem of having a "stateful server" and crash 
recovery,
which res

Re: nfs lockd errors after NetApp software upgrade.

2020-01-08 Thread Daniel Braniss
top posting NetAPP reply:
…
Here you can see transaction ID (0x5e15f77a) being used over port 886 and the 
NFS server successfully responds.
 
44806952020-01-08 12:20:54   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   V4 UNLOCK Call (Reply In 4480696) FH:0x54b075a0 svid:13629 pos:0-0
44806962020-01-08 12:20:54   132.65.60.56
132.65.116.111 NLM  0x5e15f77a (1578497914) 4045
   V4 UNLOCK Reply (Call In 4480695)
 
Here you see that 2 minutes later the client uses the same transaction ID 
(0x5e15f77a) and the same port again, but the file handle is different, so the 
client is unlocking a different file.
 
45911362020-01-08 12:22:54   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
45925882020-01-08 12:22:57   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
45988622020-01-08 12:23:03   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
46088712020-01-08 12:23:21   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
46359842020-01-08 12:23:59   132.65.116.111  
132.65.60.56   NLM  0x5e15f77a (1578497914) 886 
   [RPC retransmission of #4480695]V4 UNLOCK Call (Reply In 4480696) 
FH:0xb14b75a8 svid:13629 pos:0-0
 
transaction ID reuse is also seen for a number of other transaction IDs 
starting at the same time.
 
Withing ONTAP 9.3 we have changed the way our Replay-Cache tracks requests by 
including a checksum of the RPC request. Both in in this and earlier releases 
ONTAP would cache the call in frame 4480695, but starintg in 9.3 we then cache 
the checksum as part of that.
 
When the client sends the request in frame 4591136 it uses the same transaction 
ID (0x5e15f77a) and same port again. Here the problem is that we already hold a 
checksum in cache for the “same transaction”
 …

this seems to be happening after the client did not receive the response and 
re-transmits the request.

danny


> On 24 Dec 2019, at 5:02, Rick Macklem  wrote:
> 
> Richard P Mackerras wrote:
>> Hi,
>> 
>> We had some bully type workloads emerge when we moved a lot of block
>> storage from old XIV to new all flash 3PAR. I wonder if your IMAP issue
>> might have emerged just because suddenly there was the opportunity with all
>> flash. QOS is good on 9.x ONTAP. If anyone says it’s not then they last
>> looked on 8.x. So I suggest you QOS the IMAP workload.
>> 
>> Nobody should be using UDP with NFS unless they have a very specific set
>> of circumstances. TCP was a real step forward.
> Well, I can't argue with this, considering I did the first working 
> implementation
> of NFS over TCP. It was actually Mike Karels that suggested I try doing so,
> There's a paper in a very old Usenix Conference Proceedings, but it is so old
> that it isn't on the Usenix web page (around 1988 in Denver, if I recall).  I 
> don't
> even have a copy myself, although I was the author.
> 
> Now, having said that, I must note that the Network Lock Manager (NLM) and
> Network Status Monitor (NSM) were not NFS. They were separate stateful
> protocols (poorly designed imho) that Sun never published.
> 
> NFS as Sun designed it (NFSv2 and NFSv3) were "stateless server" protocols,
> so that they could work reliably without server crash recovery.
> However, the NLM was inherently stateful, since it was dealing with file 
> locks.
> 
> So, you can't really lump the NLM with NFS (and you should avoid use of the
> NLM over any transport imho).
> 
> NFSv4 tackled the difficult problem of having a "stateful server" and crash 
> recovery,
> which resulted in a much more complex protocol (compare the size of RFC-1813
> vs RFC-5661 to get some idea of this).
> 
> rick
> 
> Cheers
> 
> Richard
> ___
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
> ___
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

___

Re: nfs lockd errors after NetApp software upgrade.

2019-12-23 Thread Rick Macklem
Richard P Mackerras wrote:
>Hi,
>
>We had some bully type workloads emerge when we moved a lot of block
>storage from old XIV to new all flash 3PAR. I wonder if your IMAP issue
>might have emerged just because suddenly there was the opportunity with all
>flash. QOS is good on 9.x ONTAP. If anyone says it’s not then they last
>looked on 8.x. So I suggest you QOS the IMAP workload.
>
> Nobody should be using UDP with NFS unless they have a very specific set
>of circumstances. TCP was a real step forward.
Well, I can't argue with this, considering I did the first working 
implementation
of NFS over TCP. It was actually Mike Karels that suggested I try doing so,
There's a paper in a very old Usenix Conference Proceedings, but it is so old
that it isn't on the Usenix web page (around 1988 in Denver, if I recall).  I 
don't
even have a copy myself, although I was the author.

Now, having said that, I must note that the Network Lock Manager (NLM) and
Network Status Monitor (NSM) were not NFS. They were separate stateful
protocols (poorly designed imho) that Sun never published.

NFS as Sun designed it (NFSv2 and NFSv3) were "stateless server" protocols,
so that they could work reliably without server crash recovery.
However, the NLM was inherently stateful, since it was dealing with file locks.

So, you can't really lump the NLM with NFS (and you should avoid use of the
NLM over any transport imho).

NFSv4 tackled the difficult problem of having a "stateful server" and crash 
recovery,
which resulted in a much more complex protocol (compare the size of RFC-1813
vs RFC-5661 to get some idea of this).

rick

Cheers

Richard
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: nfs lockd errors after NetApp software upgrade.

2019-12-23 Thread Richard P Mackerras
Hi,

We had some bully type workloads emerge when we moved a lot of block
storage from old XIV to new all flash 3PAR. I wonder if your IMAP issue
might have emerged just because suddenly there was the opportunity with all
flash. QOS is good on 9.x ONTAP. If anyone says it’s not then they last
looked on 8.x. So I suggest you QOS the IMAP workload.

 Nobody should be using UDP with NFS unless they have a very specific set
of circumstances. TCP was a real step forward.

Cheers

Richard
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: nfs lockd errors after NetApp software upgrade.

2019-12-23 Thread Adam McDougall
On 12/22/19 12:01 PM, Rick Macklem wrote:

> Well, I've noted the flawed protocol. Here's an example (from my limited 
> understanding of these protocols, where there has never been a published 
> spec) :
> - The NLM supports a "blocking lock request" that goes something like this...
>- client requests lock and is willing to wait for it
>- if server has a conflicting lock on the file, it replies "I'll acquire 
> the lock for
>   you when I can and let you know".
>  --> When the conflicting lock is released, the server acquires the lock 
> and does
> a callback (server->client RPC) to tell the client it now has the 
> lock.
> You don't have to think about this for long to realize that any network 
> unreliability
> or partitioning could result in trouble.
> The kernel RPC layer may do some retries of the RPCs (this is controlled by 
> the
> parameters set for the RPC), but at some point the protocol asks the NSM
> (rpc.statd) if the machine is "up" and then uses the NSM's answer to deal 
> with it.
> (The NSM basically pokes other systems and notes they are "up" if they get
>  replies to these pokes. It uses IP broadcast at some point.)
> 
> Now, maybe switching to TCP will make the RPCs reliable enough that it will
> work, or maybe it won't? (It certainly sounds like the Netapp upgrade is 
> causing
> some kind of network issue, and the NLM doesn't tolerate that well.)
> 
> rick

tl;dr I think netapp effectively nerfed UDP lockd performance in newer
versions, maybe cluster mode.

>From my very un-fun experience after migrating our volumes off an older
netapp onto a new netapp with flash drives (plenty fast) running Ontap
9.x ("cluster mode"), our typical IO load from idle time IMAP
connections was enough to overwhelm the new netapp and drive performance
into the ground. The same IO that was perfectly fine on the old netapp.
Going into a workday in this state was absolutely not possible. I opened
a high priority ticket with netapp, didn't really get anywhere that very
long day and settled on nolockd so I could go home and sleep. Both my
hunch later and netapp support suggested switching lockd traffic to TCP
even though I had no network problems (the old netapp was fine). I think
I still run into occasional load issues but the newer netapp OS seemed
way more capable of this load when using TCP lockd. Of course they also
suggested switching to nfsv4 but I could not seriously entertain
validating that type of change for production in less than a day.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: nfs lockd errors after NetApp software upgrade.

2019-12-22 Thread Rick Macklem
Daniel Braniss wrote:
>> On 21 Dec 2019, at 19:32, Rick Macklem  wrote:
>>
>> Daniel Braniss wrote:
 On 20 Dec 2019, at 19:19, Rick Macklem 
 >>>mailto:rmack...@uoguelph.ca>> wrote:

 Adam McDougall wrote:
> Try changing bool_t do_tcp = FALSE; to TRUE in
> /usr/src/sys/nlm/nlm_prot_impl.c, recompile the kernel and try again. I
> think this makes it match Linux client behavior. I suspect I ran into
> the same issue as you. I do think I used nolockd is a workaround
> temporarily. I can provide some more details if it works.
 If this fixes the problem, please let me know.

 I'm not sure I'd want to change the default, since it might break things 
 for
 others, but I can definitely make it a tunable, so that people don't need 
 to
 recompile a kernel to deal with it.


>>> great! I was just about to see how it can be done(tunable) but need to 
>>> check if it can >be done
>>> at any time, or just at boot time.
>> I haven't looked at the code, but I suspect changing it on the fly could 
>> cause problems,
>> so I am inclined to make it a tunable (boot time only).
my feelings too.
>>
>>> thanks.
>>> btw, currently, from several hours of analysing the traffic, it seems that 
>>> nlm is UDP.
>> I assume that means you haven't tried flipping it to TCP yet.
>I will soon, but I have my doubts, the problem is caused my multiple events, 
>i.e, it >happened once while
>I was doing svn checkout, but i have done it several times since, and no 
>issues. So it >must be
>an aggregation of factors. Other hosts are reporting locks times too.
Well, I've noted the flawed protocol. Here's an example (from my limited 
understanding of these protocols, where there has never been a published spec) :
- The NLM supports a "blocking lock request" that goes something like this...
   - client requests lock and is willing to wait for it
   - if server has a conflicting lock on the file, it replies "I'll acquire the 
lock for
  you when I can and let you know".
 --> When the conflicting lock is released, the server acquires the lock 
and does
a callback (server->client RPC) to tell the client it now has the 
lock.
You don't have to think about this for long to realize that any network 
unreliability
or partitioning could result in trouble.
The kernel RPC layer may do some retries of the RPCs (this is controlled by the
parameters set for the RPC), but at some point the protocol asks the NSM
(rpc.statd) if the machine is "up" and then uses the NSM's answer to deal with 
it.
(The NSM basically pokes other systems and notes they are "up" if they get
 replies to these pokes. It uses IP broadcast at some point.)

Now, maybe switching to TCP will make the RPCs reliable enough that it will
work, or maybe it won't? (It certainly sounds like the Netapp upgrade is causing
some kind of network issue, and the NLM doesn't tolerate that well.)

rick

danny

>
> Please let us know how it goes, rick
>
> danny
>
>
> rick
>
> On 12/19/19 9:21 AM, Daniel Braniss wrote:
>
>
> On 19 Dec 2019, at 16:09, Rick Macklem 
> mailto:rmack...@uoguelph.ca>> wrote:
>
> Daniel Braniss wrote:
> [stuff snipped]
> all mounts are nfsv3/tcp
> This doesn't affect what the NLM code (rpc.lockd) uses. I honestly don't know 
> when
> the NLM uses tcp vs udp. I think rpc.statd still uses IP broadcast at times.
> can the replay cache have any influence here? I tend to remember way back 
> issues
> with it,
>
> To me, it looks like a network configuration issue.
> that was/is my gut feelings too, but, as far as we can tell, nothing has 
> changed in the network infrastructure,
> the problems appeared after the NetAPP’s software was updated, it was working 
> fine till then.
>
> the problems are also happening on freebsd 12.1
>
> You could capture packets (maybe when a client first starts rpc.statd and 
> rpc.lockd)
> and then look at them in wireshark. I'd disable statup of rpc.lockd and 
> rpc.statd
> at boot for a test client and then run something like:
> # tcpdump -s 0 -s out.pcap host 
> - and then start rpc.statd and rpc.lockd
> Then I'd look at out.pcap in wireshark (much better at decoding this stuff 
> than
> tcpdump). I'd look for things like different reply IP addresses from the 
> Netapp,
> which might confuse this tired old NLM protocol Sun devised in the mid-1980s.
>
> it’s going to be an interesting week end :-(
>
> the error is also appearing on freebsd-11.2-stable, I’m now checking if it’s 
> also
> happening on 12.1
> btw, the NetApp version is 9.3P17
> Yes. I wasn't the author of the NSM and NLM code (long ago I refused to even
> try to implement it, because I knew the protocol was badly broken) and I avoid
> fiddling with. As such, it won't have change much since around FreeBSD7.
> and we haven’t had any issues with it for years, so you must have done 
> something good
>
> cheers,
> danny
>
>
> rick
>
> cheers,
>  danny
>
> rick
>
> Cheers
>
> Richard
> (NetApp 

Re: nfs lockd errors after NetApp software upgrade.

2019-12-21 Thread Daniel Braniss


> On 21 Dec 2019, at 19:32, Rick Macklem  wrote:
> 
> Daniel Braniss wrote:
>>> On 20 Dec 2019, at 19:19, Rick Macklem 
>>> >>mailto:rmack...@uoguelph.ca>> wrote:
>>> 
>>> Adam McDougall wrote:
 Try changing bool_t do_tcp = FALSE; to TRUE in
 /usr/src/sys/nlm/nlm_prot_impl.c, recompile the kernel and try again. I
 think this makes it match Linux client behavior. I suspect I ran into
 the same issue as you. I do think I used nolockd is a workaround
 temporarily. I can provide some more details if it works.
>>> If this fixes the problem, please let me know.
>>> 
>>> I'm not sure I'd want to change the default, since it might break things for
>>> others, but I can definitely make it a tunable, so that people don't need to
>>> recompile a kernel to deal with it.
>>> 
>>> 
>> great! I was just about to see how it can be done(tunable) but need to check 
>> if it can >be done
>> at any time, or just at boot time.
> I haven't looked at the code, but I suspect changing it on the fly could 
> cause problems,
> so I am inclined to make it a tunable (boot time only).
my feelings too.
> 
>> thanks.
>> btw, currently, from several hours of analysing the traffic, it seems that 
>> nlm is UDP.
> I assume that means you haven't tried flipping it to TCP yet.
I will soon, but I have my doubts, the problem is caused my multiple events, 
i.e, it happened once while
I was doing svn checkout, but i have done it several times since, and no 
issues. So it must be
an aggregation of factors. Other hosts are reporting locks times too.

danny

> 
> Please let us know how it goes, rick
> 
> danny
> 
> 
> rick
> 
> On 12/19/19 9:21 AM, Daniel Braniss wrote:
> 
> 
> On 19 Dec 2019, at 16:09, Rick Macklem 
> mailto:rmack...@uoguelph.ca>> wrote:
> 
> Daniel Braniss wrote:
> [stuff snipped]
> all mounts are nfsv3/tcp
> This doesn't affect what the NLM code (rpc.lockd) uses. I honestly don't know 
> when
> the NLM uses tcp vs udp. I think rpc.statd still uses IP broadcast at times.
> can the replay cache have any influence here? I tend to remember way back 
> issues
> with it,
> 
> To me, it looks like a network configuration issue.
> that was/is my gut feelings too, but, as far as we can tell, nothing has 
> changed in the network infrastructure,
> the problems appeared after the NetAPP’s software was updated, it was working 
> fine till then.
> 
> the problems are also happening on freebsd 12.1
> 
> You could capture packets (maybe when a client first starts rpc.statd and 
> rpc.lockd)
> and then look at them in wireshark. I'd disable statup of rpc.lockd and 
> rpc.statd
> at boot for a test client and then run something like:
> # tcpdump -s 0 -s out.pcap host 
> - and then start rpc.statd and rpc.lockd
> Then I'd look at out.pcap in wireshark (much better at decoding this stuff 
> than
> tcpdump). I'd look for things like different reply IP addresses from the 
> Netapp,
> which might confuse this tired old NLM protocol Sun devised in the mid-1980s.
> 
> it’s going to be an interesting week end :-(
> 
> the error is also appearing on freebsd-11.2-stable, I’m now checking if it’s 
> also
> happening on 12.1
> btw, the NetApp version is 9.3P17
> Yes. I wasn't the author of the NSM and NLM code (long ago I refused to even
> try to implement it, because I knew the protocol was badly broken) and I avoid
> fiddling with. As such, it won't have change much since around FreeBSD7.
> and we haven’t had any issues with it for years, so you must have done 
> something good
> 
> cheers,
> danny
> 
> 
> rick
> 
> cheers,
>  danny
> 
> rick
> 
> Cheers
> 
> Richard
> (NetApp admin)
> 
> On Wed, 18 Dec 2019 at 15:46, Daniel Braniss 
> mailto:da...@cs.huji.ac.il>> 
> wrote:
> 
> 
> On 18 Dec 2019, at 16:55, Rick Macklem 
> mailto:rmack...@uoguelph.ca>>
>  wrote:
> 
> Daniel Braniss wrote:
> 
> Hi,
> The server with the problems is running FreeBSD 11.1 stable, it was working 
> fine for >several months,
> but after a software upgrade of our NetAPP server it’s reporting many lockd 
> errors >and becomes catatonic,
> ...
> Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not responding
> Dec 18 13:11:45 moo-09 last message repeated 7 times
> Dec 18 13:12:55 moo-09 last message repeated 8 times
> Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive again
> Dec 18 13:13:10 moo-09 last message repeated 8 times
> Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
> queue >overflow: 194 already in queue awaiting acceptance (1 occurrences)
> Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
> queue >overflow: 193 already in queue awaiting acceptance (3957 occurrences)
> Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
> queue >overflow: 193 already in queue awaiting acceptance …
> Seems like their software upgrade didn't improve handling of NLM RPCs?
> Appears to be 

Re: nfs lockd errors after NetApp software upgrade.

2019-12-21 Thread Rick Macklem
Daniel Braniss wrote:
>>On 20 Dec 2019, at 19:19, Rick Macklem 
mailto:rmack...@uoguelph.ca>> wrote:
>>
>>Adam McDougall wrote:
>>>Try changing bool_t do_tcp = FALSE; to TRUE in
>>>/usr/src/sys/nlm/nlm_prot_impl.c, recompile the kernel and try again. I
>>>think this makes it match Linux client behavior. I suspect I ran into
>>>the same issue as you. I do think I used nolockd is a workaround
>>>temporarily. I can provide some more details if it works.
>>If this fixes the problem, please let me know.
>>
>>I'm not sure I'd want to change the default, since it might break things for
>>others, but I can definitely make it a tunable, so that people don't need to
>>recompile a kernel to deal with it.
>>
>>
>great! I was just about to see how it can be done(tunable) but need to check 
>if it can >be done
>at any time, or just at boot time.
I haven't looked at the code, but I suspect changing it on the fly could cause 
problems,
so I am inclined to make it a tunable (boot time only).

>thanks.
>btw, currently, from several hours of analysing the traffic, it seems that nlm 
>is UDP.
I assume that means you haven't tried flipping it to TCP yet.

Please let us know how it goes, rick

danny


rick

On 12/19/19 9:21 AM, Daniel Braniss wrote:


On 19 Dec 2019, at 16:09, Rick Macklem 
mailto:rmack...@uoguelph.ca>> wrote:

Daniel Braniss wrote:
[stuff snipped]
all mounts are nfsv3/tcp
This doesn't affect what the NLM code (rpc.lockd) uses. I honestly don't know 
when
the NLM uses tcp vs udp. I think rpc.statd still uses IP broadcast at times.
can the replay cache have any influence here? I tend to remember way back issues
with it,

To me, it looks like a network configuration issue.
that was/is my gut feelings too, but, as far as we can tell, nothing has 
changed in the network infrastructure,
the problems appeared after the NetAPP’s software was updated, it was working 
fine till then.

the problems are also happening on freebsd 12.1

You could capture packets (maybe when a client first starts rpc.statd and 
rpc.lockd)
and then look at them in wireshark. I'd disable statup of rpc.lockd and 
rpc.statd
at boot for a test client and then run something like:
# tcpdump -s 0 -s out.pcap host 
- and then start rpc.statd and rpc.lockd
Then I'd look at out.pcap in wireshark (much better at decoding this stuff than
tcpdump). I'd look for things like different reply IP addresses from the Netapp,
which might confuse this tired old NLM protocol Sun devised in the mid-1980s.

it’s going to be an interesting week end :-(

the error is also appearing on freebsd-11.2-stable, I’m now checking if it’s 
also
happening on 12.1
btw, the NetApp version is 9.3P17
Yes. I wasn't the author of the NSM and NLM code (long ago I refused to even
try to implement it, because I knew the protocol was badly broken) and I avoid
fiddling with. As such, it won't have change much since around FreeBSD7.
and we haven’t had any issues with it for years, so you must have done 
something good

cheers,
 danny


rick

cheers,
  danny

rick

Cheers

Richard
(NetApp admin)

On Wed, 18 Dec 2019 at 15:46, Daniel Braniss 
mailto:da...@cs.huji.ac.il>> 
wrote:


On 18 Dec 2019, at 16:55, Rick Macklem 
mailto:rmack...@uoguelph.ca>>
 wrote:

Daniel Braniss wrote:

Hi,
The server with the problems is running FreeBSD 11.1 stable, it was working 
fine for >several months,
but after a software upgrade of our NetAPP server it’s reporting many lockd 
errors >and becomes catatonic,
...
Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not responding
Dec 18 13:11:45 moo-09 last message repeated 7 times
Dec 18 13:12:55 moo-09 last message repeated 8 times
Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive again
Dec 18 13:13:10 moo-09 last message repeated 8 times
Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen queue 
>overflow: 194 already in queue awaiting acceptance (1 occurrences)
Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen queue 
>overflow: 193 already in queue awaiting acceptance (3957 occurrences)
Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen queue 
>overflow: 193 already in queue awaiting acceptance …
Seems like their software upgrade didn't improve handling of NLM RPCs?
Appears to be handling RPCs slowly and/or intermittently. Note that no one
tests it with IPv6, so at least make sure you are still using IPv4 for the 
mounts and
try and make sure IP broadcast works between client and Netapp. I think the NLM
and NSM (rpc.statd) still use IP broadcast sometimes.

we are ipv4 - we have our own class c :-)
Maybe the network guys can suggest more w.r.t. why, but as I've stated before,
the NLM is a fundamentally broken protocol which was never published by Sun,
so I suggest you avoid using it if at all possible.
well, at the moment the ball is on NetAPP court, and switching to NFSv4 at the 
moment 

Re: nfs lockd errors after NetApp software upgrade.

2019-12-20 Thread Daniel Braniss


> On 20 Dec 2019, at 19:19, Rick Macklem  wrote:
> 
> Adam McDougall wrote:
>> Try changing bool_t do_tcp = FALSE; to TRUE in
>> /usr/src/sys/nlm/nlm_prot_impl.c, recompile the kernel and try again. I
>> think this makes it match Linux client behavior. I suspect I ran into
>> the same issue as you. I do think I used nolockd is a workaround
>> temporarily. I can provide some more details if it works.
> If this fixes the problem, please let me know.
> 
> I'm not sure I'd want to change the default, since it might break things for
> others, but I can definitely make it a tunable, so that people don't need to
> recompile a kernel to deal with it.
> 

great! I was just about to see how it can be done(tunable) but need to check if 
it can be done
at any time, or just at boot time.
thanks.
btw, currently, from several hours of analysing the traffic, it seems that nlm 
is UDP.
danny


> rick
> 
> On 12/19/19 9:21 AM, Daniel Braniss wrote:
>> 
>> 
>>> On 19 Dec 2019, at 16:09, Rick Macklem  wrote:
>>> 
>>> Daniel Braniss wrote:
>>> [stuff snipped]
 all mounts are nfsv3/tcp
>>> This doesn't affect what the NLM code (rpc.lockd) uses. I honestly don't 
>>> know when
>>> the NLM uses tcp vs udp. I think rpc.statd still uses IP broadcast at times.
>> can the replay cache have any influence here? I tend to remember way back 
>> issues
>> with it,
>>> 
>>> To me, it looks like a network configuration issue.
>> that was/is my gut feelings too, but, as far as we can tell, nothing has 
>> changed in the network infrastructure,
>> the problems appeared after the NetAPP’s software was updated, it was 
>> working fine till then.
>> 
>> the problems are also happening on freebsd 12.1
>> 
>>> You could capture packets (maybe when a client first starts rpc.statd and 
>>> rpc.lockd)
>>> and then look at them in wireshark. I'd disable statup of rpc.lockd and 
>>> rpc.statd
>>> at boot for a test client and then run something like:
>>> # tcpdump -s 0 -s out.pcap host 
>>> - and then start rpc.statd and rpc.lockd
>>> Then I'd look at out.pcap in wireshark (much better at decoding this stuff 
>>> than
>>> tcpdump). I'd look for things like different reply IP addresses from the 
>>> Netapp,
>>> which might confuse this tired old NLM protocol Sun devised in the 
>>> mid-1980s.
>>> 
>> it’s going to be an interesting week end :-(
>> 
 the error is also appearing on freebsd-11.2-stable, I’m now checking if 
 it’s also
 happening on 12.1
 btw, the NetApp version is 9.3P17
>>> Yes. I wasn't the author of the NSM and NLM code (long ago I refused to even
>>> try to implement it, because I knew the protocol was badly broken) and I 
>>> avoid
>>> fiddling with. As such, it won't have change much since around FreeBSD7.
>> and we haven’t had any issues with it for years, so you must have done 
>> something good
>> 
>> cheers,
>>  danny
>> 
>>> 
>>> rick
>>> 
>>> cheers,
>>>   danny
>>> 
 rick
 
 Cheers
 
 Richard
 (NetApp admin)
 
 On Wed, 18 Dec 2019 at 15:46, Daniel Braniss 
 mailto:da...@cs.huji.ac.il>> wrote:
 
 
> On 18 Dec 2019, at 16:55, Rick Macklem 
> mailto:rmack...@uoguelph.ca>> wrote:
> 
> Daniel Braniss wrote:
> 
>> Hi,
>> The server with the problems is running FreeBSD 11.1 stable, it was 
>> working fine for >several months,
>> but after a software upgrade of our NetAPP server it’s reporting many 
>> lockd errors >and becomes catatonic,
>> ...
>> Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not 
>> responding
>> Dec 18 13:11:45 moo-09 last message repeated 7 times
>> Dec 18 13:12:55 moo-09 last message repeated 8 times
>> Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive 
>> again
>> Dec 18 13:13:10 moo-09 last message repeated 8 times
>> Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
>> queue >overflow: 194 already in queue awaiting acceptance (1 occurrences)
>> Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
>> queue >overflow: 193 already in queue awaiting acceptance (3957 
>> occurrences)
>> Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
>> queue >overflow: 193 already in queue awaiting acceptance …
> Seems like their software upgrade didn't improve handling of NLM RPCs?
> Appears to be handling RPCs slowly and/or intermittently. Note that no one
> tests it with IPv6, so at least make sure you are still using IPv4 for 
> the mounts and
> try and make sure IP broadcast works between client and Netapp. I think 
> the NLM
> and NSM (rpc.statd) still use IP broadcast sometimes.
> 
 we are ipv4 - we have our own class c :-)
> Maybe the network guys can suggest more w.r.t. why, but as I've stated 
> before,
> the NLM is a fundamentally broken protocol which was never published by 
> Sun,

Re: nfs lockd errors after NetApp software upgrade.

2019-12-20 Thread Rick Macklem
Adam McDougall wrote:
>Try changing bool_t do_tcp = FALSE; to TRUE in
>/usr/src/sys/nlm/nlm_prot_impl.c, recompile the kernel and try again. I
>think this makes it match Linux client behavior. I suspect I ran into
>the same issue as you. I do think I used nolockd is a workaround
>temporarily. I can provide some more details if it works.
If this fixes the problem, please let me know.

I'm not sure I'd want to change the default, since it might break things for
others, but I can definitely make it a tunable, so that people don't need to
recompile a kernel to deal with it.

rick

On 12/19/19 9:21 AM, Daniel Braniss wrote:
>
>
>> On 19 Dec 2019, at 16:09, Rick Macklem  wrote:
>>
>> Daniel Braniss wrote:
>> [stuff snipped]
>>> all mounts are nfsv3/tcp
>> This doesn't affect what the NLM code (rpc.lockd) uses. I honestly don't 
>> know when
>> the NLM uses tcp vs udp. I think rpc.statd still uses IP broadcast at times.
> can the replay cache have any influence here? I tend to remember way back 
> issues
> with it,
>>
>> To me, it looks like a network configuration issue.
> that was/is my gut feelings too, but, as far as we can tell, nothing has 
> changed in the network infrastructure,
> the problems appeared after the NetAPP’s software was updated, it was working 
> fine till then.
>
> the problems are also happening on freebsd 12.1
>
>> You could capture packets (maybe when a client first starts rpc.statd and 
>> rpc.lockd)
>> and then look at them in wireshark. I'd disable statup of rpc.lockd and 
>> rpc.statd
>> at boot for a test client and then run something like:
>> # tcpdump -s 0 -s out.pcap host 
>> - and then start rpc.statd and rpc.lockd
>> Then I'd look at out.pcap in wireshark (much better at decoding this stuff 
>> than
>> tcpdump). I'd look for things like different reply IP addresses from the 
>> Netapp,
>> which might confuse this tired old NLM protocol Sun devised in the mid-1980s.
>>
> it’s going to be an interesting week end :-(
>
>>> the error is also appearing on freebsd-11.2-stable, I’m now checking if 
>>> it’s also
>>> happening on 12.1
>>> btw, the NetApp version is 9.3P17
>> Yes. I wasn't the author of the NSM and NLM code (long ago I refused to even
>> try to implement it, because I knew the protocol was badly broken) and I 
>> avoid
>> fiddling with. As such, it won't have change much since around FreeBSD7.
> and we haven’t had any issues with it for years, so you must have done 
> something good
>
> cheers,
>   danny
>
>>
>> rick
>>
>> cheers,
>>danny
>>
>>> rick
>>>
>>> Cheers
>>>
>>> Richard
>>> (NetApp admin)
>>>
>>> On Wed, 18 Dec 2019 at 15:46, Daniel Braniss 
>>> mailto:da...@cs.huji.ac.il>> wrote:
>>>
>>>
 On 18 Dec 2019, at 16:55, Rick Macklem 
 mailto:rmack...@uoguelph.ca>> wrote:

 Daniel Braniss wrote:

> Hi,
> The server with the problems is running FreeBSD 11.1 stable, it was 
> working fine for >several months,
> but after a software upgrade of our NetAPP server it’s reporting many 
> lockd errors >and becomes catatonic,
> ...
> Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not 
> responding
> Dec 18 13:11:45 moo-09 last message repeated 7 times
> Dec 18 13:12:55 moo-09 last message repeated 8 times
> Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive 
> again
> Dec 18 13:13:10 moo-09 last message repeated 8 times
> Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
> queue >overflow: 194 already in queue awaiting acceptance (1 occurrences)
> Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
> queue >overflow: 193 already in queue awaiting acceptance (3957 
> occurrences)
> Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
> queue >overflow: 193 already in queue awaiting acceptance …
 Seems like their software upgrade didn't improve handling of NLM RPCs?
 Appears to be handling RPCs slowly and/or intermittently. Note that no one
 tests it with IPv6, so at least make sure you are still using IPv4 for the 
 mounts and
 try and make sure IP broadcast works between client and Netapp. I think 
 the NLM
 and NSM (rpc.statd) still use IP broadcast sometimes.

>>> we are ipv4 - we have our own class c :-)
 Maybe the network guys can suggest more w.r.t. why, but as I've stated 
 before,
 the NLM is a fundamentally broken protocol which was never published by 
 Sun,
 so I suggest you avoid using it if at all possible.
>>> well, at the moment the ball is on NetAPP court, and switching to NFSv4 at 
>>> the moment is out of the question, it’s
>>> a production server used by several thousand students.
>>>

 - If the locks don't need to be seen by other clients, you can just use 
 the "nolockd"
 mount option.
 or
 - If locks need to be seen by other clients, try NFSv4 mounts. Netapp 
 

Re: nfs lockd errors after NetApp software upgrade.

2019-12-19 Thread Adam McDougall
Try changing bool_t do_tcp = FALSE; to TRUE in
/usr/src/sys/nlm/nlm_prot_impl.c, recompile the kernel and try again. I
think this makes it match Linux client behavior. I suspect I ran into
the same issue as you. I do think I used nolockd is a workaround
temporarily. I can provide some more details if it works.

On 12/19/19 9:21 AM, Daniel Braniss wrote:
> 
> 
>> On 19 Dec 2019, at 16:09, Rick Macklem  wrote:
>>
>> Daniel Braniss wrote:
>> [stuff snipped]
>>> all mounts are nfsv3/tcp
>> This doesn't affect what the NLM code (rpc.lockd) uses. I honestly don't 
>> know when
>> the NLM uses tcp vs udp. I think rpc.statd still uses IP broadcast at times.
> can the replay cache have any influence here? I tend to remember way back 
> issues
> with it,
>>
>> To me, it looks like a network configuration issue.
> that was/is my gut feelings too, but, as far as we can tell, nothing has 
> changed in the network infrastructure,
> the problems appeared after the NetAPP’s software was updated, it was working 
> fine till then.
> 
> the problems are also happening on freebsd 12.1
> 
>> You could capture packets (maybe when a client first starts rpc.statd and 
>> rpc.lockd)
>> and then look at them in wireshark. I'd disable statup of rpc.lockd and 
>> rpc.statd
>> at boot for a test client and then run something like:
>> # tcpdump -s 0 -s out.pcap host 
>> - and then start rpc.statd and rpc.lockd
>> Then I'd look at out.pcap in wireshark (much better at decoding this stuff 
>> than
>> tcpdump). I'd look for things like different reply IP addresses from the 
>> Netapp,
>> which might confuse this tired old NLM protocol Sun devised in the mid-1980s.
>>
> it’s going to be an interesting week end :-(
>  
>>> the error is also appearing on freebsd-11.2-stable, I’m now checking if 
>>> it’s also
>>> happening on 12.1
>>> btw, the NetApp version is 9.3P17
>> Yes. I wasn't the author of the NSM and NLM code (long ago I refused to even
>> try to implement it, because I knew the protocol was badly broken) and I 
>> avoid
>> fiddling with. As such, it won't have change much since around FreeBSD7.
> and we haven’t had any issues with it for years, so you must have done 
> something good
> 
> cheers,
>   danny
> 
>>
>> rick
>>
>> cheers,
>>danny
>>
>>> rick
>>>
>>> Cheers
>>>
>>> Richard
>>> (NetApp admin)
>>>
>>> On Wed, 18 Dec 2019 at 15:46, Daniel Braniss 
>>> mailto:da...@cs.huji.ac.il>> wrote:
>>>
>>>
 On 18 Dec 2019, at 16:55, Rick Macklem 
 mailto:rmack...@uoguelph.ca>> wrote:

 Daniel Braniss wrote:

> Hi,
> The server with the problems is running FreeBSD 11.1 stable, it was 
> working fine for >several months,
> but after a software upgrade of our NetAPP server it’s reporting many 
> lockd errors >and becomes catatonic,
> ...
> Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not 
> responding
> Dec 18 13:11:45 moo-09 last message repeated 7 times
> Dec 18 13:12:55 moo-09 last message repeated 8 times
> Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive 
> again
> Dec 18 13:13:10 moo-09 last message repeated 8 times
> Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
> queue >overflow: 194 already in queue awaiting acceptance (1 occurrences)
> Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
> queue >overflow: 193 already in queue awaiting acceptance (3957 
> occurrences)
> Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
> queue >overflow: 193 already in queue awaiting acceptance …
 Seems like their software upgrade didn't improve handling of NLM RPCs?
 Appears to be handling RPCs slowly and/or intermittently. Note that no one
 tests it with IPv6, so at least make sure you are still using IPv4 for the 
 mounts and
 try and make sure IP broadcast works between client and Netapp. I think 
 the NLM
 and NSM (rpc.statd) still use IP broadcast sometimes.

>>> we are ipv4 - we have our own class c :-)
 Maybe the network guys can suggest more w.r.t. why, but as I've stated 
 before,
 the NLM is a fundamentally broken protocol which was never published by 
 Sun,
 so I suggest you avoid using it if at all possible.
>>> well, at the moment the ball is on NetAPP court, and switching to NFSv4 at 
>>> the moment is out of the question, it’s
>>> a production server used by several thousand students.
>>>

 - If the locks don't need to be seen by other clients, you can just use 
 the "nolockd"
 mount option.
 or
 - If locks need to be seen by other clients, try NFSv4 mounts. Netapp 
 filers
 should support NFSv4.1, which is a much better protocol that NFSv4.0.

 Good luck with it, rick
>>> thanks
>>>   danny
>>>
 …
 any ideas?

 thanks,
  danny

 ___

Re: nfs lockd errors after NetApp software upgrade.

2019-12-19 Thread Richard P Mackerras
Hi,
At ONTAP 9.3P6 there is a possible LACP group issue after upgrade. Have you
checked any LACP groups,
These should not be a problem but I assume network interfaces are at the
home ports, not on slower ports or something silly. It is marginally better
if the traffic goes direct to the node where the volume is but the
difference should nothing. Have you looked at the NetApp performance data?

If you are going to do wireshark tcpdumps then you might want to run them
from the NetApp as well.

https://kb.netapp.com/app/answers/answer_view/a_id/1029833/~/how-to-capture-packet-traces-%28tcpdump%29-on-ontap-9.2%2B-systems-



::> network tcpdump start -node  -port e0a -buffer-size 2097151

Let us know how you go,

Richard
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: nfs lockd errors after NetApp software upgrade.

2019-12-19 Thread Daniel Braniss


> On 19 Dec 2019, at 16:09, Rick Macklem  wrote:
> 
> Daniel Braniss wrote:
> [stuff snipped]
>> all mounts are nfsv3/tcp
> This doesn't affect what the NLM code (rpc.lockd) uses. I honestly don't know 
> when
> the NLM uses tcp vs udp. I think rpc.statd still uses IP broadcast at times.
can the replay cache have any influence here? I tend to remember way back issues
with it,
> 
> To me, it looks like a network configuration issue.
that was/is my gut feelings too, but, as far as we can tell, nothing has 
changed in the network infrastructure,
the problems appeared after the NetAPP’s software was updated, it was working 
fine till then.

the problems are also happening on freebsd 12.1

> You could capture packets (maybe when a client first starts rpc.statd and 
> rpc.lockd)
> and then look at them in wireshark. I'd disable statup of rpc.lockd and 
> rpc.statd
> at boot for a test client and then run something like:
> # tcpdump -s 0 -s out.pcap host 
> - and then start rpc.statd and rpc.lockd
> Then I'd look at out.pcap in wireshark (much better at decoding this stuff 
> than
> tcpdump). I'd look for things like different reply IP addresses from the 
> Netapp,
> which might confuse this tired old NLM protocol Sun devised in the mid-1980s.
> 
it’s going to be an interesting week end :-(
 
>> the error is also appearing on freebsd-11.2-stable, I’m now checking if it’s 
>> also
>> happening on 12.1
>> btw, the NetApp version is 9.3P17
> Yes. I wasn't the author of the NSM and NLM code (long ago I refused to even
> try to implement it, because I knew the protocol was badly broken) and I avoid
> fiddling with. As such, it won't have change much since around FreeBSD7.
and we haven’t had any issues with it for years, so you must have done 
something good

cheers,
danny

> 
> rick
> 
> cheers,
>danny
> 
>> rick
>> 
>> Cheers
>> 
>> Richard
>> (NetApp admin)
>> 
>> On Wed, 18 Dec 2019 at 15:46, Daniel Braniss 
>> mailto:da...@cs.huji.ac.il>> wrote:
>> 
>> 
>>> On 18 Dec 2019, at 16:55, Rick Macklem 
>>> mailto:rmack...@uoguelph.ca>> wrote:
>>> 
>>> Daniel Braniss wrote:
>>> 
 Hi,
 The server with the problems is running FreeBSD 11.1 stable, it was 
 working fine for >several months,
 but after a software upgrade of our NetAPP server it’s reporting many 
 lockd errors >and becomes catatonic,
 ...
 Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not 
 responding
 Dec 18 13:11:45 moo-09 last message repeated 7 times
 Dec 18 13:12:55 moo-09 last message repeated 8 times
 Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive 
 again
 Dec 18 13:13:10 moo-09 last message repeated 8 times
 Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
 queue >overflow: 194 already in queue awaiting acceptance (1 occurrences)
 Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
 queue >overflow: 193 already in queue awaiting acceptance (3957 
 occurrences)
 Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
 queue >overflow: 193 already in queue awaiting acceptance …
>>> Seems like their software upgrade didn't improve handling of NLM RPCs?
>>> Appears to be handling RPCs slowly and/or intermittently. Note that no one
>>> tests it with IPv6, so at least make sure you are still using IPv4 for the 
>>> mounts and
>>> try and make sure IP broadcast works between client and Netapp. I think the 
>>> NLM
>>> and NSM (rpc.statd) still use IP broadcast sometimes.
>>> 
>> we are ipv4 - we have our own class c :-)
>>> Maybe the network guys can suggest more w.r.t. why, but as I've stated 
>>> before,
>>> the NLM is a fundamentally broken protocol which was never published by Sun,
>>> so I suggest you avoid using it if at all possible.
>> well, at the moment the ball is on NetAPP court, and switching to NFSv4 at 
>> the moment is out of the question, it’s
>> a production server used by several thousand students.
>> 
>>> 
>>> - If the locks don't need to be seen by other clients, you can just use the 
>>> "nolockd"
>>> mount option.
>>> or
>>> - If locks need to be seen by other clients, try NFSv4 mounts. Netapp filers
>>> should support NFSv4.1, which is a much better protocol that NFSv4.0.
>>> 
>>> Good luck with it, rick
>> thanks
>>   danny
>> 
>>> …
>>> any ideas?
>>> 
>>> thanks,
>>>  danny
>>> 
>>> ___
>>> freebsd-stable@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
>>> To unsubscribe, send any mail to 
>>> "freebsd-stable-unsubscr...@freebsd.org"
>> 
>> ___
>> freebsd-stable@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> To unsubscribe, send any mail to 

Re: nfs lockd errors after NetApp software upgrade.

2019-12-19 Thread Rick Macklem
Daniel Braniss wrote:
[stuff snipped]
>all mounts are nfsv3/tcp
This doesn't affect what the NLM code (rpc.lockd) uses. I honestly don't know 
when
the NLM uses tcp vs udp. I think rpc.statd still uses IP broadcast at times.

To me, it looks like a network configuration issue.
You could capture packets (maybe when a client first starts rpc.statd and 
rpc.lockd)
and then look at them in wireshark. I'd disable statup of rpc.lockd and 
rpc.statd
at boot for a test client and then run something like:
# tcpdump -s 0 -s out.pcap host 
- and then start rpc.statd and rpc.lockd
Then I'd look at out.pcap in wireshark (much better at decoding this stuff than
tcpdump). I'd look for things like different reply IP addresses from the Netapp,
which might confuse this tired old NLM protocol Sun devised in the mid-1980s.

>the error is also appearing on freebsd-11.2-stable, I’m now checking if it’s 
>also
>happening on 12.1
>btw, the NetApp version is 9.3P17
Yes. I wasn't the author of the NSM and NLM code (long ago I refused to even
try to implement it, because I knew the protocol was badly broken) and I avoid
fiddling with. As such, it won't have change much since around FreeBSD7.

rick

cheers,
danny

> rick
>
> Cheers
>
> Richard
> (NetApp admin)
>
> On Wed, 18 Dec 2019 at 15:46, Daniel Braniss 
> mailto:da...@cs.huji.ac.il>> wrote:
>
>
>> On 18 Dec 2019, at 16:55, Rick Macklem 
>> mailto:rmack...@uoguelph.ca>> wrote:
>>
>> Daniel Braniss wrote:
>>
>>> Hi,
>>> The server with the problems is running FreeBSD 11.1 stable, it was working 
>>> fine for >several months,
>>> but after a software upgrade of our NetAPP server it’s reporting many lockd 
>>> errors >and becomes catatonic,
>>> ...
>>> Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not 
>>> responding
>>> Dec 18 13:11:45 moo-09 last message repeated 7 times
>>> Dec 18 13:12:55 moo-09 last message repeated 8 times
>>> Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive 
>>> again
>>> Dec 18 13:13:10 moo-09 last message repeated 8 times
>>> Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
>>> queue >overflow: 194 already in queue awaiting acceptance (1 occurrences)
>>> Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
>>> queue >overflow: 193 already in queue awaiting acceptance (3957 occurrences)
>>> Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
>>> queue >overflow: 193 already in queue awaiting acceptance …
>> Seems like their software upgrade didn't improve handling of NLM RPCs?
>> Appears to be handling RPCs slowly and/or intermittently. Note that no one
>> tests it with IPv6, so at least make sure you are still using IPv4 for the 
>> mounts and
>> try and make sure IP broadcast works between client and Netapp. I think the 
>> NLM
>> and NSM (rpc.statd) still use IP broadcast sometimes.
>>
> we are ipv4 - we have our own class c :-)
>> Maybe the network guys can suggest more w.r.t. why, but as I've stated 
>> before,
>> the NLM is a fundamentally broken protocol which was never published by Sun,
>> so I suggest you avoid using it if at all possible.
> well, at the moment the ball is on NetAPP court, and switching to NFSv4 at 
> the moment is out of the question, it’s
> a production server used by several thousand students.
>
>>
>> - If the locks don't need to be seen by other clients, you can just use the 
>> "nolockd"
>>  mount option.
>> or
>> - If locks need to be seen by other clients, try NFSv4 mounts. Netapp filers
>>  should support NFSv4.1, which is a much better protocol that NFSv4.0.
>>
>> Good luck with it, rick
> thanks
>danny
>
>> …
>> any ideas?
>>
>> thanks,
>>   danny
>>
>> ___
>> freebsd-stable@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> To unsubscribe, send any mail to 
>> "freebsd-stable-unsubscr...@freebsd.org"
>
> ___
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to 
> "freebsd-stable-unsubscr...@freebsd.org"

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: nfs lockd errors after NetApp software upgrade.

2019-12-19 Thread Daniel Braniss


> On 19 Dec 2019, at 02:22, Rick Macklem  wrote:
> 
> Richard P Mackerras wrote:
> 
>> Hi,
>> What software version is the NetApp using?
>> Is the exported volume big?
>> Is the vserver configured for 64bit identifiers?
>> 
>> If you enable NFS V4.0 or 4.1 other NFS clients using defaults might mount 
>> NFSv4.x >unexpectedly after a reboot so you need to watch that.
> The FreeBSD client always uses NFSv3 mounts by default. To get NFSv4 you must
> explicitly specify the "nfsv4" or "vers=4" mount option. For NFSv4.1, you must
> also specify "minorversion=1”.
> 
> The Linux distros I am familiar with will use the highest NFS version 
> supported by
> the server by default. (I suspect some are using NFSv4.1 without realizing it,
> which isn't necessarily bad.)
> 
> nfsstat -m
> will show you which version is actually in use for both FreeBSD and Linux.
> 
all mounts are nfsv3/tcp
the error is also appearing on freebsd-11.2-stable, I’m now checking if it’s 
also
happening on 12.1
btw, the NetApp version is 9.3P17

cheers,
danny

> rick
> 
> Cheers
> 
> Richard
> (NetApp admin)
> 
> On Wed, 18 Dec 2019 at 15:46, Daniel Braniss 
> mailto:da...@cs.huji.ac.il>> wrote:
> 
> 
>> On 18 Dec 2019, at 16:55, Rick Macklem 
>> mailto:rmack...@uoguelph.ca>> wrote:
>> 
>> Daniel Braniss wrote:
>> 
>>> Hi,
>>> The server with the problems is running FreeBSD 11.1 stable, it was working 
>>> fine for >several months,
>>> but after a software upgrade of our NetAPP server it’s reporting many lockd 
>>> errors >and becomes catatonic,
>>> ...
>>> Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not 
>>> responding
>>> Dec 18 13:11:45 moo-09 last message repeated 7 times
>>> Dec 18 13:12:55 moo-09 last message repeated 8 times
>>> Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive 
>>> again
>>> Dec 18 13:13:10 moo-09 last message repeated 8 times
>>> Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
>>> queue >overflow: 194 already in queue awaiting acceptance (1 occurrences)
>>> Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
>>> queue >overflow: 193 already in queue awaiting acceptance (3957 occurrences)
>>> Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
>>> queue >overflow: 193 already in queue awaiting acceptance …
>> Seems like their software upgrade didn't improve handling of NLM RPCs?
>> Appears to be handling RPCs slowly and/or intermittently. Note that no one
>> tests it with IPv6, so at least make sure you are still using IPv4 for the 
>> mounts and
>> try and make sure IP broadcast works between client and Netapp. I think the 
>> NLM
>> and NSM (rpc.statd) still use IP broadcast sometimes.
>> 
> we are ipv4 - we have our own class c :-)
>> Maybe the network guys can suggest more w.r.t. why, but as I've stated 
>> before,
>> the NLM is a fundamentally broken protocol which was never published by Sun,
>> so I suggest you avoid using it if at all possible.
> well, at the moment the ball is on NetAPP court, and switching to NFSv4 at 
> the moment is out of the question, it’s
> a production server used by several thousand students.
> 
>> 
>> - If the locks don't need to be seen by other clients, you can just use the 
>> "nolockd"
>>  mount option.
>> or
>> - If locks need to be seen by other clients, try NFSv4 mounts. Netapp filers
>>  should support NFSv4.1, which is a much better protocol that NFSv4.0.
>> 
>> Good luck with it, rick
> thanks
>danny
> 
>> …
>> any ideas?
>> 
>> thanks,
>>   danny
>> 
>> ___
>> freebsd-stable@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> To unsubscribe, send any mail to 
>> "freebsd-stable-unsubscr...@freebsd.org"
> 
> ___
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to 
> "freebsd-stable-unsubscr...@freebsd.org"

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: nfs lockd errors after NetApp software upgrade.

2019-12-18 Thread Rick Macklem
Richard P Mackerras wrote:

>Hi,
>What software version is the NetApp using?
>Is the exported volume big?
>Is the vserver configured for 64bit identifiers?
>
>If you enable NFS V4.0 or 4.1 other NFS clients using defaults might mount 
>NFSv4.x >unexpectedly after a reboot so you need to watch that.
The FreeBSD client always uses NFSv3 mounts by default. To get NFSv4 you must
explicitly specify the "nfsv4" or "vers=4" mount option. For NFSv4.1, you must
also specify "minorversion=1".

The Linux distros I am familiar with will use the highest NFS version supported 
by
the server by default. (I suspect some are using NFSv4.1 without realizing it,
which isn't necessarily bad.)

nfsstat -m
will show you which version is actually in use for both FreeBSD and Linux.

rick

Cheers

Richard
(NetApp admin)

On Wed, 18 Dec 2019 at 15:46, Daniel Braniss 
mailto:da...@cs.huji.ac.il>> wrote:


> On 18 Dec 2019, at 16:55, Rick Macklem 
> mailto:rmack...@uoguelph.ca>> wrote:
>
> Daniel Braniss wrote:
>
>> Hi,
>> The server with the problems is running FreeBSD 11.1 stable, it was working 
>> fine for >several months,
>> but after a software upgrade of our NetAPP server it’s reporting many lockd 
>> errors >and becomes catatonic,
>> ...
>> Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not 
>> responding
>> Dec 18 13:11:45 moo-09 last message repeated 7 times
>> Dec 18 13:12:55 moo-09 last message repeated 8 times
>> Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive 
>> again
>> Dec 18 13:13:10 moo-09 last message repeated 8 times
>> Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
>> queue >overflow: 194 already in queue awaiting acceptance (1 occurrences)
>> Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
>> queue >overflow: 193 already in queue awaiting acceptance (3957 occurrences)
>> Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
>> queue >overflow: 193 already in queue awaiting acceptance …
> Seems like their software upgrade didn't improve handling of NLM RPCs?
> Appears to be handling RPCs slowly and/or intermittently. Note that no one
> tests it with IPv6, so at least make sure you are still using IPv4 for the 
> mounts and
> try and make sure IP broadcast works between client and Netapp. I think the 
> NLM
> and NSM (rpc.statd) still use IP broadcast sometimes.
>
we are ipv4 - we have our own class c :-)
> Maybe the network guys can suggest more w.r.t. why, but as I've stated before,
> the NLM is a fundamentally broken protocol which was never published by Sun,
> so I suggest you avoid using it if at all possible.
well, at the moment the ball is on NetAPP court, and switching to NFSv4 at the 
moment is out of the question, it’s
a production server used by several thousand students.

>
> - If the locks don't need to be seen by other clients, you can just use the 
> "nolockd"
>   mount option.
> or
> - If locks need to be seen by other clients, try NFSv4 mounts. Netapp filers
>   should support NFSv4.1, which is a much better protocol that NFSv4.0.
>
> Good luck with it, rick
thanks
danny

> …
> any ideas?
>
> thanks,
>danny
>
> ___
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to 
> "freebsd-stable-unsubscr...@freebsd.org"

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to 
"freebsd-stable-unsubscr...@freebsd.org"
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: nfs lockd errors after NetApp software upgrade.

2019-12-18 Thread Richard P Mackerras
Hi,
I’m sure the 64 bit identifiers isn’t an issue. Your export isn’t vast. I
assume you have restarted statd and lockd on FreeBSD.
I did search on the NetApp site earlier and nothing lept out then. Sorry,
Richard


On Wed, 18 Dec 2019 at 16:06, Daniel Braniss  wrote:

>
>
> On 18 Dec 2019, at 17:58, Richard P Mackerras 
> wrote:
>
> Hi,
> What software version is the NetApp using?
>
> the very latest :-), but will try and find out later.
>
> Is the exported volume big?
>
> about 500G, but many files
> as far as I know, only accessed by one host running the web app - moodle.
>
> Is the vserver configured for 64bit identifiers
>
> what the issue here?
>
> ?
>
> If you enable NFS V4.0 or 4.1 other NFS clients using defaults might mount
> NFSv4.x unexpectedly after a reboot so you need to watch that.
>
> Cheers
>
> Richard
> (NetApp admin)
>
> On Wed, 18 Dec 2019 at 15:46, Daniel Braniss  wrote:
>
>>
>>
>> > On 18 Dec 2019, at 16:55, Rick Macklem  wrote:
>> >
>> > Daniel Braniss wrote:
>> >
>> >> Hi,
>> >> The server with the problems is running FreeBSD 11.1 stable, it was
>> working fine for >several months,
>> >> but after a software upgrade of our NetAPP server it’s reporting many
>> lockd errors >and becomes catatonic,
>> >> ...
>> >> Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not
>> responding
>> >> Dec 18 13:11:45 moo-09 last message repeated 7 times
>> >> Dec 18 13:12:55 moo-09 last message repeated 8 times
>> >> Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is
>> alive again
>> >> Dec 18 13:13:10 moo-09 last message repeated 8 times
>> >> Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0:
>> Listen queue >overflow: 194 already in queue awaiting acceptance (1
>> occurrences)
>> >> Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0:
>> Listen queue >overflow: 193 already in queue awaiting acceptance (3957
>> occurrences)
>> >> Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0:
>> Listen queue >overflow: 193 already in queue awaiting acceptance …
>> > Seems like their software upgrade didn't improve handling of NLM RPCs?
>> > Appears to be handling RPCs slowly and/or intermittently. Note that no
>> one
>> > tests it with IPv6, so at least make sure you are still using IPv4 for
>> the mounts and
>> > try and make sure IP broadcast works between client and Netapp. I think
>> the NLM
>> > and NSM (rpc.statd) still use IP broadcast sometimes.
>> >
>> we are ipv4 - we have our own class c :-)
>> > Maybe the network guys can suggest more w.r.t. why, but as I've stated
>> before,
>> > the NLM is a fundamentally broken protocol which was never published by
>> Sun,
>> > so I suggest you avoid using it if at all possible.
>> well, at the moment the ball is on NetAPP court, and switching to NFSv4
>> at the moment is out of the question, it’s
>> a production server used by several thousand students.
>>
>> >
>> > - If the locks don't need to be seen by other clients, you can just use
>> the "nolockd"
>> >   mount option.
>> > or
>> > - If locks need to be seen by other clients, try NFSv4 mounts. Netapp
>> filers
>> >   should support NFSv4.1, which is a much better protocol that NFSv4.0.
>> >
>> > Good luck with it, rick
>> thanks
>> danny
>>
>> > …
>> > any ideas?
>> >
>> > thanks,
>> >danny
>> >
>> > ___
>> > freebsd-stable@freebsd.org mailing list
>> > https://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> > To unsubscribe, send any mail to "
>> freebsd-stable-unsubscr...@freebsd.org"
>>
>> ___
>> freebsd-stable@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>>
>
>
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: nfs lockd errors after NetApp software upgrade.

2019-12-18 Thread Daniel Braniss


> On 18 Dec 2019, at 17:58, Richard P Mackerras  wrote:
> 
> Hi,
> What software version is the NetApp using?
the very latest :-), but will try and find out later.

> Is the exported volume big?
about 500G, but many files
as far as I know, only accessed by one host running the web app - moodle.

> Is the vserver configured for 64bit identifiers
what the issue here?

> ?
> 
> If you enable NFS V4.0 or 4.1 other NFS clients using defaults might mount 
> NFSv4.x unexpectedly after a reboot so you need to watch that. 
> 
> Cheers 
> 
> Richard 
> (NetApp admin)
> 
> On Wed, 18 Dec 2019 at 15:46, Daniel Braniss  > wrote:
> 
> 
> > On 18 Dec 2019, at 16:55, Rick Macklem  > > wrote:
> > 
> > Daniel Braniss wrote:
> > 
> >> Hi,
> >> The server with the problems is running FreeBSD 11.1 stable, it was 
> >> working fine for >several months,
> >> but after a software upgrade of our NetAPP server it’s reporting many 
> >> lockd errors >and becomes catatonic,
> >> ...
> >> Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not 
> >> responding
> >> Dec 18 13:11:45 moo-09 last message repeated 7 times
> >> Dec 18 13:12:55 moo-09 last message repeated 8 times
> >> Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive 
> >> again
> >> Dec 18 13:13:10 moo-09 last message repeated 8 times
> >> Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
> >> queue >overflow: 194 already in queue awaiting acceptance (1 occurrences)
> >> Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
> >> queue >overflow: 193 already in queue awaiting acceptance (3957 
> >> occurrences)
> >> Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
> >> queue >overflow: 193 already in queue awaiting acceptance …
> > Seems like their software upgrade didn't improve handling of NLM RPCs?
> > Appears to be handling RPCs slowly and/or intermittently. Note that no one
> > tests it with IPv6, so at least make sure you are still using IPv4 for the 
> > mounts and
> > try and make sure IP broadcast works between client and Netapp. I think the 
> > NLM
> > and NSM (rpc.statd) still use IP broadcast sometimes.
> > 
> we are ipv4 - we have our own class c :-)
> > Maybe the network guys can suggest more w.r.t. why, but as I've stated 
> > before,
> > the NLM is a fundamentally broken protocol which was never published by Sun,
> > so I suggest you avoid using it if at all possible.
> well, at the moment the ball is on NetAPP court, and switching to NFSv4 at 
> the moment is out of the question, it’s
> a production server used by several thousand students.
> 
> > 
> > - If the locks don't need to be seen by other clients, you can just use the 
> > "nolockd"
> >   mount option.
> > or
> > - If locks need to be seen by other clients, try NFSv4 mounts. Netapp filers
> >   should support NFSv4.1, which is a much better protocol that NFSv4.0.
> > 
> > Good luck with it, rick
> thanks
> danny
> 
> > …
> > any ideas?
> > 
> > thanks,
> >danny
> > 
> > ___
> > freebsd-stable@freebsd.org  mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-stable 
> > 
> > To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org 
> > "
> 
> ___
> freebsd-stable@freebsd.org  mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable 
> 
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org 
> "

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: nfs lockd errors after NetApp software upgrade.

2019-12-18 Thread Richard P Mackerras
Hi,
What software version is the NetApp using?
Is the exported volume big?
Is the vserver configured for 64bit identifiers?

If you enable NFS V4.0 or 4.1 other NFS clients using defaults might mount
NFSv4.x unexpectedly after a reboot so you need to watch that.

Cheers

Richard
(NetApp admin)

On Wed, 18 Dec 2019 at 15:46, Daniel Braniss  wrote:

>
>
> > On 18 Dec 2019, at 16:55, Rick Macklem  wrote:
> >
> > Daniel Braniss wrote:
> >
> >> Hi,
> >> The server with the problems is running FreeBSD 11.1 stable, it was
> working fine for >several months,
> >> but after a software upgrade of our NetAPP server it’s reporting many
> lockd errors >and becomes catatonic,
> >> ...
> >> Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not
> responding
> >> Dec 18 13:11:45 moo-09 last message repeated 7 times
> >> Dec 18 13:12:55 moo-09 last message repeated 8 times
> >> Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is
> alive again
> >> Dec 18 13:13:10 moo-09 last message repeated 8 times
> >> Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0:
> Listen queue >overflow: 194 already in queue awaiting acceptance (1
> occurrences)
> >> Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0:
> Listen queue >overflow: 193 already in queue awaiting acceptance (3957
> occurrences)
> >> Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0:
> Listen queue >overflow: 193 already in queue awaiting acceptance …
> > Seems like their software upgrade didn't improve handling of NLM RPCs?
> > Appears to be handling RPCs slowly and/or intermittently. Note that no
> one
> > tests it with IPv6, so at least make sure you are still using IPv4 for
> the mounts and
> > try and make sure IP broadcast works between client and Netapp. I think
> the NLM
> > and NSM (rpc.statd) still use IP broadcast sometimes.
> >
> we are ipv4 - we have our own class c :-)
> > Maybe the network guys can suggest more w.r.t. why, but as I've stated
> before,
> > the NLM is a fundamentally broken protocol which was never published by
> Sun,
> > so I suggest you avoid using it if at all possible.
> well, at the moment the ball is on NetAPP court, and switching to NFSv4 at
> the moment is out of the question, it’s
> a production server used by several thousand students.
>
> >
> > - If the locks don't need to be seen by other clients, you can just use
> the "nolockd"
> >   mount option.
> > or
> > - If locks need to be seen by other clients, try NFSv4 mounts. Netapp
> filers
> >   should support NFSv4.1, which is a much better protocol that NFSv4.0.
> >
> > Good luck with it, rick
> thanks
> danny
>
> > …
> > any ideas?
> >
> > thanks,
> >danny
> >
> > ___
> > freebsd-stable@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> > To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org
> "
>
> ___
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: nfs lockd errors after NetApp software upgrade.

2019-12-18 Thread Daniel Braniss


> On 18 Dec 2019, at 16:55, Rick Macklem  wrote:
> 
> Daniel Braniss wrote:
> 
>> Hi,
>> The server with the problems is running FreeBSD 11.1 stable, it was working 
>> fine for >several months,
>> but after a software upgrade of our NetAPP server it’s reporting many lockd 
>> errors >and becomes catatonic,
>> ...
>> Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not 
>> responding
>> Dec 18 13:11:45 moo-09 last message repeated 7 times
>> Dec 18 13:12:55 moo-09 last message repeated 8 times
>> Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive 
>> again
>> Dec 18 13:13:10 moo-09 last message repeated 8 times
>> Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
>> queue >overflow: 194 already in queue awaiting acceptance (1 occurrences)
>> Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
>> queue >overflow: 193 already in queue awaiting acceptance (3957 occurrences)
>> Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen 
>> queue >overflow: 193 already in queue awaiting acceptance …
> Seems like their software upgrade didn't improve handling of NLM RPCs?
> Appears to be handling RPCs slowly and/or intermittently. Note that no one
> tests it with IPv6, so at least make sure you are still using IPv4 for the 
> mounts and
> try and make sure IP broadcast works between client and Netapp. I think the 
> NLM
> and NSM (rpc.statd) still use IP broadcast sometimes.
> 
we are ipv4 - we have our own class c :-)
> Maybe the network guys can suggest more w.r.t. why, but as I've stated before,
> the NLM is a fundamentally broken protocol which was never published by Sun,
> so I suggest you avoid using it if at all possible.
well, at the moment the ball is on NetAPP court, and switching to NFSv4 at the 
moment is out of the question, it’s
a production server used by several thousand students.

> 
> - If the locks don't need to be seen by other clients, you can just use the 
> "nolockd"
>   mount option.
> or
> - If locks need to be seen by other clients, try NFSv4 mounts. Netapp filers
>   should support NFSv4.1, which is a much better protocol that NFSv4.0.
> 
> Good luck with it, rick
thanks
danny

> …
> any ideas?
> 
> thanks,
>danny
> 
> ___
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: nfs lockd errors after NetApp software upgrade.

2019-12-18 Thread Rick Macklem
Daniel Braniss wrote:

>Hi,
>The server with the problems is running FreeBSD 11.1 stable, it was working 
>fine for >several months,
>but after a software upgrade of our NetAPP server it’s reporting many lockd 
>errors >and becomes catatonic,
>...
>Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not responding
>Dec 18 13:11:45 moo-09 last message repeated 7 times
>Dec 18 13:12:55 moo-09 last message repeated 8 times
>Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive again
>Dec 18 13:13:10 moo-09 last message repeated 8 times
>Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen queue 
>>overflow: 194 already in queue awaiting acceptance (1 occurrences)
>Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen queue 
>>overflow: 193 already in queue awaiting acceptance (3957 occurrences)
>Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen queue 
>>overflow: 193 already in queue awaiting acceptance …
Seems like their software upgrade didn't improve handling of NLM RPCs?
Appears to be handling RPCs slowly and/or intermittently. Note that no one
tests it with IPv6, so at least make sure you are still using IPv4 for the 
mounts and
try and make sure IP broadcast works between client and Netapp. I think the NLM
and NSM (rpc.statd) still use IP broadcast sometimes.

Maybe the network guys can suggest more w.r.t. why, but as I've stated before,
the NLM is a fundamentally broken protocol which was never published by Sun,
so I suggest you avoid using it if at all possible.

- If the locks don't need to be seen by other clients, you can just use the 
"nolockd"
   mount option.
or
- If locks need to be seen by other clients, try NFSv4 mounts. Netapp filers
   should support NFSv4.1, which is a much better protocol that NFSv4.0.

Good luck with it, rick
…
any ideas?

thanks,
danny

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


nfs lockd errors after NetApp software upgrade.

2019-12-18 Thread Daniel Braniss
Hi,
The server with the problems is running FreeBSD 11.1 stable, it was working 
fine for several months,
but after a software upgrade of our NetAPP server it’s reporting many lockd 
errors and becomes catatonic,
...
Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not responding
Dec 18 13:11:45 moo-09 last message repeated 7 times
Dec 18 13:12:55 moo-09 last message repeated 8 times
Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive again
Dec 18 13:13:10 moo-09 last message repeated 8 times
Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen queue 
overflow: 194 already in queue awaiting acceptance (1 occurrences)
Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen queue 
overflow: 193 already in queue awaiting acceptance (3957 occurrences)
Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen queue 
overflow: 193 already in queue awaiting acceptance (3404 occurrences)
Dec 18 13:16:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen queue 
overflow: 196 already in queue awaiting acceptance (3553 occurrences)
Dec 18 13:17:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen queue 
overflow: 194 already in queue awaiting acceptance (3661 occurrences)
Dec 18 13:18:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen queue 
overflow: 197 already in queue awaiting acceptance (4030 occurrences)
Dec 18 13:19:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen queue 
overflow: 193 already in queue awaiting acceptance (2560 occurrences)
Dec 18 13:20:29 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen queue 
overflow: 193 already in queue awaiting acceptance (1495 occurrences)
Dec 18 13:21:32 moo-09 kernel: sonewconn: pcb 0xf8004cc051d0: Listen queue 
overflow: 193 already in queue awaiting acceptance (817 occurrences)
Dec 18 14:54:43 moo-09 kernel: nfs server fr-06:/mdlbck: lockd not responding
Dec 18 14:55:19 moo-09 last message repeated 2 times
Dec 18 14:55:34 moo-09 kernel: nfs server fr-06:/mdlbck: lockd is alive again
…
any ideas?

thanks,
danny

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"