NFS performance (Currently 2.6.20)

2008-02-06 Thread Jesper Krogh

Hi.

I'm currently trying to optimize our NFS server. We're running in a
cluster setup with a single NFS server and some compute nodes pulling data
from it. Currently the dataset is less than 10GB so it fits in memory of
the NFS-server. (confirmed via vmstat 1).
Currently I'm  getting around 500mbit (700 peak) of the server on a
gigabit link and the server is CPU-bottlenecked when this happens. Clients
having iowait around 30-50%.

Is it reasonable to expect to be able to fill a gigabit link in this
scenario? (I'd like to put in a 10Gbit interface, but when I have a
cpu-bottleneck)

Should I go for NFSv2 (default if I dont change mount options) NFSv3 ? or
NFSv4

NFSv3 default mount options is around 1MB for rsize and wsize, but reading
the nfs-man page, they suggest setting them up to around 32K.

I probably only need some pointers to the documentation.

Thanks.
-- 
Jesper Krogh

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] NLM: Convert lockd to use kthreads

2008-02-06 Thread Jeff Layton
On Tue, 5 Feb 2008 23:35:48 -0500
Christoph Hellwig [EMAIL PROTECTED] wrote:

 On Tue, Feb 05, 2008 at 02:37:57PM -0500, Jeff Layton wrote:
  Because kthread_stop blocks until the kthread actually goes down,
  we have to send the signal before calling it. This means that there
  is a very small race window like this where lockd_down could block
  for a long time:
  
  lockd_down signals lockd
  lockd invalidates locks
  lockd flushes signals
  lockd checks kthread_should_stop
  lockd_down calls kthread_stop
  lockd calls svc_recv
  
  ...and lockd blocks until recvmsg returns. I think this is a
  pretty unlikely scenario though. We could probably ensure it
  doesn't happen with some locking but I'm not sure that it would be
  worth the trouble.
 
 This is not avoidable unless we take sending the signal into the
 kthread machinery.  
 

Yes. Perhaps we should consider a kthread_stop_with_signal() function
that does a kthread_stop and sends a signal before waiting for
completion? Most users of kthread_stop won't need it, but it would be
nice here. CIFS could also probably use something like that.

 You should probably add a comment similar to your patch description
 above the place where the signal is sent.

I'll do that and respin...

In the interest of full disclosure, we have some other options besides
sending a signal here:

1) we could call svc_recv with a shorter timeout. This means that lockd
will wake up more frequently, even when it has nothing to do.

2) we could try to ensure that when lockd_down is called that a msg
(maybe a NULL procedure) is sent to lockd's socket to wake it up after
kthread_stop is called. This probably would mean queuing up a task to a
workqueue to do this.

...neither of these seem more palatable than sending a signal.

-- 
Jeff Layton [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS performance (Currently 2.6.20)

2008-02-06 Thread Gabriel Barazer

Hi,

On 02/06/2008 11:04:34 AM +0100, Jesper Krogh [EMAIL PROTECTED] wrote:

Hi.

I'm currently trying to optimize our NFS server. We're running in a
cluster setup with a single NFS server and some compute nodes pulling data
from it. Currently the dataset is less than 10GB so it fits in memory of
the NFS-server. (confirmed via vmstat 1).
Currently I'm  getting around 500mbit (700 peak) of the server on a
gigabit link and the server is CPU-bottlenecked when this happens. Clients
having iowait around 30-50%.


I have a similar setup, and I'm very curious on how you can read an 
iowait value from the clients: On my nodes (server 2.6.21.5/clients 
2.6.23.14), the iowait counter is only incremented when dealing with 
block devices, and since my nodes are diskless my iowait is near 0%.


Maybe I'm wrong, but when the NFS servers lags, this is my system 
counter which is increased (having peaks at 30% system instead of 5-10%)



Is it reasonable to expect to be able to fill a gigabit link in this
scenario? (I'd like to put in a 10Gbit interface, but when I have a
cpu-bottleneck)


I'm sure this is possible, but it is very dependant on which kind of 
traffic you have. If you have only data to pull (which theoretically 
never invalidate the page cache on the server), and you have options 
like 'noatime,nodiratime' to avoid nfs updating the access times, it 
seems possible to me. But maybe your CPU is busy doing something else 
than only computing NFS traffic. Maybe you should change your network 
controller ? I use the Intel Gigabit ones (integrated ESB2 with e1000 
driver) with rx-polling and Intel I/OAT enabled (DMA engine), and this 
really helps by reducing interrupts when dealing with a lot of traffic.


You will have to check your kernel if you have IOAT enabled in the DMA 
engines section.




Should I go for NFSv2 (default if I dont change mount options) NFSv3 ? or
NFSv4


NFSv2/3 have nearly the same performance, and NFSv4 has a slight 
negative hit probably because of its earlyness: it's too early to work 
on the performances when features are not completely stable.




NFSv3 default mount options is around 1MB for rsize and wsize, but reading
the nfs-man page, they suggest setting them up to around 32K.


the values for rsize and wsize mount options depends on the amount of 
memory you have (on the server AFAIK), and when you have 4GB the values 
are not very realistic anymore. On my systems I have the defaults 
rsize/wsize set to 512KB and all is running fine, but I sure there is 
some work to be done to adjust more precisely the buffers size when 
dealing with large memory amounts (e.g. a 1MB buffer is a non-sense). 
The 32k value in a very old one and the man page doesn't even explain 
the memory-related rsize/wsize values.




I probably only need some pointers to the documentation.


And the documentation probably needs some refresh, but things are 
changing nearly every week here...


Gabriel
-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: (fwd) nfs hang on 2.6.24

2008-02-06 Thread Trond Myklebust

On Wed, 2008-02-06 at 19:24 +1300, Andrew Dixie wrote:
  The fact that the delegreturn call appears to have hit xprt_timer is
  interesting. Under normal circumstances, timeouts should never occur
  under NFSv4. Could you tell us what mount options you're using here?
 
  Also please could you confirm for us that the server is still up and
  responding to requests from other clients.
 
 The mount options were defaults:
 i.e. mount -t nfs4 server:/mnt /mnt
 sshd has died. I will confrim exactly what is in /proc/mounts when I get 
 physical access.
 
 The server is still up serving active nfsv3 clients. I mounted nfsv4 on 
 another client and that worked too.

Thanks. My other questions are:

  What is rpciod doing while the machine hangs?
  Does 'netstat -t' show an active tcp connection to the server?
  Does tcpdump show any traffic going on the wire?
  What server are you running against? From the error messages below, I
see it is a Linux machine, but which kernel is it running?

 The following appears in the server logs:
 Feb  4 08:28:01 devfile kernel: NFSD: setclientid: string in use by 
 client(clientid 47945499/1c88)
 Feb  4 08:34:18 devfile kernel: NFSD: setclientid: string in use by 
 client(clientid 47945499/1c8d)
 Feb  4 08:38:02 devfile kernel: NFSD: setclientid: string in use by 
 client(clientid 47945499/1c8f)
 Feb  4 10:01:02 devfile kernel: NFSD: setclientid: string in use by 
 client(clientid 47a627bd/0002)
 Feb  4 10:07:37 devfile kernel: NFSD: setclientid: string in use by 
 client(clientid 47a627bd/0005)
 Feb  4 10:17:02 devfile kernel: NFSD: setclientid: string in use by 
 client(clientid 47a627bd/019e)
 Feb  5 07:59:58 devfile kernel: NFSD: setclientid: string in use by 
 client(clientid 47a627bd/03f2)
 Feb  5 08:01:02 devfile kernel: NFSD: setclientid: string in use by 
 client(clientid 47a627bd/03f3)
 
 These are not close to the times that it hung.

Yep. The above is entirely expected, and is not actually a bug. I keep
asking Bruce to remove that warning...

 Prior to Feb 4 it occurs 10 to 50 times a day (from when the client was 
 running 2.6.18 kernel)
 There is only one nfsv4 client.

OK...

Thanks
  Trond
-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: (fwd) nfs hang on 2.6.24

2008-02-06 Thread Trond Myklebust

On Wed, 2008-02-06 at 10:07 -0500, J. Bruce Fields wrote:

 That went into 2.6.22:
 
   21315edd4877b593d5bf.. [PATCH] knfsd: nfsd4: demote clientid
   in use printk to a dprintk
 
 It may suggest a problem if this is happening a lot, though, right?

The client should always be able to generate a new unique clientid if
this happens.

Trond
-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS performance (Currently 2.6.20)

2008-02-06 Thread Trond Myklebust

On Wed, 2008-02-06 at 15:37 +0100, Gabriel Barazer wrote:

  
  Should I go for NFSv2 (default if I dont change mount options) NFSv3 ? or
  NFSv4
 
 NFSv2/3 have nearly the same performance

Only if you shoot yourself in the foot by setting the 'async' flag
in /etc/exports. Don't do that...

Most people will want to use NFSv3 for performance reasons. Unlike NFSv2
with 'async', NFSv3 with the 'sync' export flag set actually does _safe_
server-side caching of writes.

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS performance (Currently 2.6.20)

2008-02-06 Thread Jesper Krogh
 Hi,
 I'm currently trying to optimize our NFS server. We're running in a
 cluster setup with a single NFS server and some compute nodes pulling
 data from it. Currently the dataset is less than 10GB so it fits in
 memory of the NFS-server. (confirmed via vmstat 1). Currently I'm
 getting around 500mbit (700 peak) of the server on a gigabit link and
 the server is CPU-bottlenecked when this happens. Clients having iowait
 around 30-50%.

 I have a similar setup, and I'm very curious on how you can read an
 iowait value from the clients: On my nodes (server 2.6.21.5/clients
 2.6.23.14), the iowait counter is only incremented when dealing with
 block devices, and since my nodes are diskless my iowait is near 0%.

Output in top is like this:
top - 16:51:01 up 119 days,  6:10,  1 user,  load average: 2.09, 2.00, 1.41
Tasks:  74 total,   2 running,  72 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.2%us,  0.0%sy,  0.0%ni, 50.0%id, 49.8%wa,  0.0%hi,  0.0%si, 
0.0%st
Mem:   2060188k total,  2047488k used,12700k free, 2988k buffers
Swap:  4200988k total,42776k used,  4158212k free,  1985500k cached

 Is it reasonable to expect to be able to fill a gigabit link in this
 scenario? (I'd like to put in a 10Gbit interface, but when I have a
 cpu-bottleneck)

 I'm sure this is possible, but it is very dependant on which kind of
 traffic you have. If you have only data to pull (which theoretically never
 invalidate the page cache on the server), and you have options like
 'noatime,nodiratime' to avoid nfs updating the access times, it
 seems possible to me. But maybe your CPU is busy doing something else than
 only computing NFS traffic. Maybe you should change your network
 controller ? I use the Intel Gigabit ones (integrated ESB2 with e1000
 driver) with rx-polling and Intel I/OAT enabled (DMA engine), and this
 really helps by reducing interrupts when dealing with a lot of traffic.

It is a Sun V20Z (dual Opteron) NIC is:
02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
Gigabit Ethernet (rev 03)

Jesper
-- 
Jesper Krogh

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] NLM: set RPC_CLNT_CREATE_NOPING for NLM RPC clients

2008-02-06 Thread Jeff Layton
It's currently possible for an unresponsive NLM client to completely
lock up a server's lockd. The scenario is something like this:

1) client1 (or a process on the server) takes a lock on a file
2) client2 tries to take a blocking lock on the same file and
   awaits the callback
3) client2 goes unresponsive (plug pulled, network partition, etc)
4) client1 releases the lock

...at that point the server's lockd will try to queue up a GRANT_MSG
callback for client2, but first it requeues the block with a timeout of
30s. nlm_async_call will attempt to bind the RPC client to client2 and
will call rpc_ping. rpc_ping entails a sync RPC call and if client2 is
unresponsive it will take around 60s for that to time out. Once it times
out, it's already time to retry the block and the whole process repeats.

Once in this situation, nlmsvc_retry_blocked will never return until
the host starts responding again. lockd won't service new calls.

Fix this by skipping the RPC ping on NLM RPC clients. This makes
nlm_async_call return quickly when called.

Signed-off-by: Jeff Layton [EMAIL PROTECTED]
---
 fs/lockd/host.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/lockd/host.c b/fs/lockd/host.c
index ca6b16f..00063ee 100644
--- a/fs/lockd/host.c
+++ b/fs/lockd/host.c
@@ -244,6 +244,7 @@ nlm_bind_host(struct nlm_host *host)
.version= host-h_version,
.authflavor = RPC_AUTH_UNIX,
.flags  = (RPC_CLNT_CREATE_HARDRTRY |
+  RPC_CLNT_CREATE_NOPING |
   RPC_CLNT_CREATE_AUTOBIND),
};
 
-- 
1.5.3.8

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] NLM: fix lockd hang when client blocking on released lock isn't responding

2008-02-06 Thread Jeff Layton
This patchset fixes the problem that Bruce pointed out last week when
we were discussing the lockd-kthread patches.

The main problem is described in patch #1 and that patch also fixes the
DoS. The remaining patches clean up how GRANT_MSG callbacks handle an
unresponsive client. The goal in those is to make sure that we don't
end up with a ton of duplicate RPC's in queue and that we try to handle
an invalidated block correctly.

Bruce, I'd like to see this fixed in 2.6.25 if at all possible.

Comments and suggestions are appreciated.

Signed-off-by: Jeff Layton [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] NLM: don't reattempt GRANT_MSG when there is already an RPC in flight

2008-02-06 Thread Jeff Layton
With the current scheme in nlmsvc_grant_blocked, we can end up with more
than one GRANT_MSG callback for a block in flight. Right now, we requeue
the block unconditionally so that a GRANT_MSG callback is done again in
30s. If the client is unresponsive, it can take more than 30s for the
call already in flight to time out.

There's no benefit to having more than one GRANT_MSG RPC queued up at a
time, so put it on the list with a timeout of NLM_NEVER before doing the
RPC call. If the RPC call submission fails, we requeue it with a short
timeout. If it works, then nlmsvc_grant_callback will end up requeueing
it with a shorter timeout after it completes.

Signed-off-by: Jeff Layton [EMAIL PROTECTED]
---
 fs/lockd/svclock.c |   17 +
 1 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
index 2f4d8fa..82db7b3 100644
--- a/fs/lockd/svclock.c
+++ b/fs/lockd/svclock.c
@@ -763,11 +763,20 @@ callback:
dprintk(lockd: GRANTing blocked lock.\n);
block-b_granted = 1;
 
-   /* Schedule next grant callback in 30 seconds */
-   nlmsvc_insert_block(block, 30 * HZ);
+   /* keep block on the list, but don't reattempt until the RPC
+* completes or the submission fails
+*/
+   nlmsvc_insert_block(block, NLM_NEVER);
 
-   /* Call the client */
-   nlm_async_call(block-b_call, NLMPROC_GRANTED_MSG, nlmsvc_grant_ops);
+   /* Call the client -- use a soft RPC task since nlmsvc_retry_blocked
+* will queue up a new one if this one times out
+*/
+   error = nlm_async_call(block-b_call, NLMPROC_GRANTED_MSG,
+   nlmsvc_grant_ops);
+
+   /* RPC submission failed, wait a bit and retry */
+   if (error  0)
+   nlmsvc_insert_block(block, 10 * HZ);
 }
 
 /*
-- 
1.5.3.8

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] NLM: don't requeue block if it was invalidated while GRANT_MSG was in flight

2008-02-06 Thread Jeff Layton
It's possible for lockd to catch a SIGKILL while a GRANT_MSG callback
is in flight. If this happens we don't want lockd to insert the block
back into the nlm_blocked list.

This helps that situation, but there's still a possible race. Fixing
that will mean adding real locking for nlm_blocked.

Signed-off-by: Jeff Layton [EMAIL PROTECTED]
---
 fs/lockd/svclock.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
index 82db7b3..fe9bdb4 100644
--- a/fs/lockd/svclock.c
+++ b/fs/lockd/svclock.c
@@ -795,6 +795,17 @@ static void nlmsvc_grant_callback(struct rpc_task *task, 
void *data)
 
dprintk(lockd: GRANT_MSG RPC callback\n);
 
+   /* if the block is not on a list at this point then it has
+* been invalidated. Don't try to requeue it.
+*
+* FIXME: it's possible that the block is removed from the list
+* after this check but before the nlmsvc_insert_block. In that
+* case it will be added back. Perhaps we need better locking
+* for nlm_blocked?
+*/
+   if (list_empty(block-b_list))
+   return;
+
/* Technically, we should down the file semaphore here. Since we
 * move the block towards the head of the queue only, no harm
 * can be done, though. */
-- 
1.5.3.8

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: (fwd) nfs hang on 2.6.24

2008-02-06 Thread J. Bruce Fields
On Wed, Feb 06, 2008 at 10:15:23AM -0500, Trond Myklebust wrote:
 
 On Wed, 2008-02-06 at 10:07 -0500, J. Bruce Fields wrote:
 
  That went into 2.6.22:
  
  21315edd4877b593d5bf.. [PATCH] knfsd: nfsd4: demote clientid
  in use printk to a dprintk
  
  It may suggest a problem if this is happening a lot, though, right?
 
 The client should always be able to generate a new unique clientid if
 this happens.

And then the client may fail to reclaim its state on the next server
reboot, or mistakenly prevent some other client from reclaiming state,
since it's not recording the new clientid in stable storage.  So if it's
happening a lot then we I suppose we should figure out better ways to
generate client id's.

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: (fwd) nfs hang on 2.6.24

2008-02-06 Thread Trond Myklebust

On Wed, 2008-02-06 at 12:23 -0500, J. Bruce Fields wrote:
 On Wed, Feb 06, 2008 at 10:15:23AM -0500, Trond Myklebust wrote:
  
  On Wed, 2008-02-06 at 10:07 -0500, J. Bruce Fields wrote:
  
   That went into 2.6.22:
   
 21315edd4877b593d5bf.. [PATCH] knfsd: nfsd4: demote clientid
 in use printk to a dprintk
   
   It may suggest a problem if this is happening a lot, though, right?
  
  The client should always be able to generate a new unique clientid if
  this happens.
 
 And then the client may fail to reclaim its state on the next server
 reboot, or mistakenly prevent some other client from reclaiming state,
 since it's not recording the new clientid in stable storage.  So if it's
 happening a lot then we I suppose we should figure out better ways to
 generate client id's.

Huh?

If the server reboots, the client will try to reclaim state using the
_same_ client identifier string.

Two clients should _not_ be able to generate the same clientid unless
they're also sharing the same IP address and a number of other
properties that we include in the client identifier.

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] convert lockd to kthread API (try #10)

2008-02-06 Thread Jeff Layton
This is the tenth iteration of the patchset to convert lockd to use the
kthread API. This patchset is smaller than the earlier ones since some
of the patches in those sets have already been taken into Bruce's tree.
This set only changes lockd to use the kthread API.

The only real difference between this patchset and the one posted
yesterday is some added comments to clarify the possible race involved
when signaling and calling kthread_stop.

Bruce, would you be willing to take this into your git tree once 2.6.25
development settles down? I'd like to have this considered for 2.6.26.

Thanks,

Signed-off-by: Jeff Layton [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] SUNRPC: export svc_sock_update_bufs

2008-02-06 Thread Jeff Layton
Needed since the plan is to not have a svc_create_thread helper and to
have current users of that function just call kthread_run directly.

Signed-off-by: Jeff Layton [EMAIL PROTECTED]
Reviewed-by: NeilBrown [EMAIL PROTECTED]
Signed-off-by: J. Bruce Fields [EMAIL PROTECTED]
---
 net/sunrpc/svcsock.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 1d3e5fc..b73a92a 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1101,6 +1101,7 @@ void svc_sock_update_bufs(struct svc_serv *serv)
}
spin_unlock_bh(serv-sv_lock);
 }
+EXPORT_SYMBOL(svc_sock_update_bufs);
 
 /*
  * Initialize socket for RPC use and create svc_sock struct
-- 
1.5.3.8

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] NLM: Convert lockd to use kthreads

2008-02-06 Thread Jeff Layton
Have lockd_up start lockd using kthread_run. With this change,
lockd_down now blocks until lockd actually exits, so there's no longer
need for the waitqueue code at the end of lockd_down. This also means
that only one lockd can be running at a time which simplifies the code
within lockd's main loop.

This also adds a check for kthread_should_stop in the main loop of
nlmsvc_retry_blocked and after that function returns. There's no sense
continuing to retry blocks if lockd is coming down anyway.

The main difference between this patch and earlier ones is that it
changes lockd_down to again send SIGKILL to lockd when it's coming
down. svc_recv() uses schedule_timeout, so we can end up blocking there
for a long time if we end up calling into it after kthread_stop wakes
up lockd. Sending a SIGKILL should help ensure that svc_recv returns
quickly if this occurs.

Because kthread_stop blocks until the kthread actually goes down,
we have to send the signal before calling it. This means that there
is a very small race window like this where lockd_down could block
for a long time:

lockd_down signals lockd
lockd invalidates locks
lockd flushes signals
lockd checks kthread_should_stop
lockd_down calls kthread_stop
lockd calls svc_recv

...and lockd blocks until svc_recv returns. I think this is a
pretty unlikely scenario though. This doesn't appear to be fixable
without changing the kthread_stop machinery to send a signal.

Signed-off-by: Jeff Layton [EMAIL PROTECTED]
---
 fs/lockd/svc.c |  144 +---
 fs/lockd/svclock.c |3 +-
 2 files changed, 72 insertions(+), 75 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 0822646..35e5ae2 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -25,6 +25,7 @@
 #include linux/smp.h
 #include linux/smp_lock.h
 #include linux/mutex.h
+#include linux/kthread.h
 #include linux/freezer.h
 
 #include linux/sunrpc/types.h
@@ -48,14 +49,11 @@ EXPORT_SYMBOL(nlmsvc_ops);
 
 static DEFINE_MUTEX(nlmsvc_mutex);
 static unsigned intnlmsvc_users;
-static pid_t   nlmsvc_pid;
+static struct task_struct  *nlmsvc_task;
 static struct svc_serv *nlmsvc_serv;
 intnlmsvc_grace_period;
 unsigned long  nlmsvc_timeout;
 
-static DECLARE_COMPLETION(lockd_start_done);
-static DECLARE_WAIT_QUEUE_HEAD(lockd_exit);
-
 /*
  * These can be set at insmod time (useful for NFS as root filesystem),
  * and also changed through the sysctl interface.  -- Jamie Lokier, Aug 2003
@@ -111,35 +109,30 @@ static inline void clear_grace_period(void)
 /*
  * This is the lockd kernel thread
  */
-static void
-lockd(struct svc_rqst *rqstp)
+static int
+lockd(void *vrqstp)
 {
int err = 0;
+   struct svc_rqst *rqstp = vrqstp;
unsigned long grace_period_expire;
 
-   /* Lock module and set up kernel thread */
-   /* lockd_up is waiting for us to startup, so will
-* be holding a reference to this module, so it
-* is safe to just claim another reference
-*/
-   __module_get(THIS_MODULE);
-   lock_kernel();
-
-   /*
-* Let our maker know we're running.
-*/
-   nlmsvc_pid = current-pid;
-   nlmsvc_serv = rqstp-rq_server;
-   complete(lockd_start_done);
-
-   daemonize(lockd);
+   /* try_to_freeze() is called from svc_recv() */
set_freezable();
 
-   /* Process request with signals blocked, but allow SIGKILL.  */
+   /* Allow SIGKILL to tell lockd to drop all of its locks */
allow_signal(SIGKILL);
 
dprintk(NFS locking service started (ver  LOCKD_VERSION ).\n);
 
+   /*
+* FIXME: it would be nice if lockd didn't spend its entire life
+* running under the BKL. At the very least, it would be good to
+* have someone clarify what it's intended to protect here. I've
+* seen some handwavy posts about posix locking needing to be
+* done under the BKL, but it's far from clear.
+*/
+   lock_kernel();
+
if (!nlm_timeout)
nlm_timeout = LOCKD_DFLT_TIMEO;
nlmsvc_timeout = nlm_timeout * HZ;
@@ -148,10 +141,9 @@ lockd(struct svc_rqst *rqstp)
 
/*
 * The main request loop. We don't terminate until the last
-* NFS mount or NFS daemon has gone away, and we've been sent a
-* signal, or else another process has taken over our job.
+* NFS mount or NFS daemon has gone away.
 */
-   while ((nlmsvc_users || !signalled())  nlmsvc_pid == current-pid) {
+   while (!kthread_should_stop()) {
long timeout = MAX_SCHEDULE_TIMEOUT;
char buf[RPC_MAX_ADDRBUFLEN];
 
@@ -161,6 +153,7 @@ lockd(struct svc_rqst *rqstp)
nlmsvc_invalidate_all();

Re: (fwd) nfs hang on 2.6.24

2008-02-06 Thread J. Bruce Fields
On Wed, Feb 06, 2008 at 12:52:17PM -0500, Trond Myklebust wrote:
 
 On Wed, 2008-02-06 at 12:23 -0500, J. Bruce Fields wrote:
  On Wed, Feb 06, 2008 at 10:15:23AM -0500, Trond Myklebust wrote:
   
   On Wed, 2008-02-06 at 10:07 -0500, J. Bruce Fields wrote:
   
That went into 2.6.22:

21315edd4877b593d5bf.. [PATCH] knfsd: nfsd4: demote clientid
in use printk to a dprintk

It may suggest a problem if this is happening a lot, though, right?
   
   The client should always be able to generate a new unique clientid if
   this happens.
  
  And then the client may fail to reclaim its state on the next server
  reboot, or mistakenly prevent some other client from reclaiming state,
  since it's not recording the new clientid in stable storage.  So if it's
  happening a lot then we I suppose we should figure out better ways to
  generate client id's.
 
 Huh?
 
 If the server reboots, the client will try to reclaim state using the
 _same_ client identifier string.

Oh, right, I was confusing client and server reboot and assuming the
client would forget the uniquifier on server reboot.  That's obviously
wrong!  The client will forget its own uniquifier on client reboot, but
that's alright since it's happy enough just to let that old state time
out at that point.  So the only possible problem is suboptimal behavior
when the client reboot time is less than the lease time.

 Two clients should _not_ be able to generate the same clientid unless
 they're also sharing the same IP address and a number of other
 properties that we include in the client identifier.

Or unless two client implementations just happen to have clashing
clientid generation algorithms, but we hope that's unlikely.

(Except that older Linux clients were prone to produce the same
clientid, if I remember right.  But the more likely explanation may be
that these are the result of a single client destroying and then
creating state on the server within a lease period, and the server being
stubborn and refusing to let go of the old state (even though no opens
are associated with it any more) until the end of a lease period.  I
think that's a server bug.)

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] NLM: Convert lockd to use kthreads

2008-02-06 Thread Trond Myklebust

On Wed, 2008-02-06 at 13:21 -0500, Jeff Layton wrote:
 Have lockd_up start lockd using kthread_run. With this change,
 lockd_down now blocks until lockd actually exits, so there's no longer
 need for the waitqueue code at the end of lockd_down. This also means
 that only one lockd can be running at a time which simplifies the code
 within lockd's main loop.
 
 This also adds a check for kthread_should_stop in the main loop of
 nlmsvc_retry_blocked and after that function returns. There's no sense
 continuing to retry blocks if lockd is coming down anyway.
 
 The main difference between this patch and earlier ones is that it
 changes lockd_down to again send SIGKILL to lockd when it's coming
 down. svc_recv() uses schedule_timeout, so we can end up blocking there
 for a long time if we end up calling into it after kthread_stop wakes
 up lockd. Sending a SIGKILL should help ensure that svc_recv returns
 quickly if this occurs.
 
 Because kthread_stop blocks until the kthread actually goes down,
 we have to send the signal before calling it. This means that there
 is a very small race window like this where lockd_down could block
 for a long time:

Having looked again at the code, could you please remind me _why_ we
need to signal the process?

AFAICS, kthread_stop() should normally wake the process up if it is in
the schedule_timeout() state in svc_recv() since it uses
wake_up_process(). Shouldn't the only difference be that svc_recv() will
return -EAGAIN instead of -EINTR?

If so, why can't we just forgo the signal?

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS performance (Currently 2.6.20)

2008-02-06 Thread Trond Myklebust

On Wed, 2008-02-06 at 19:24 +0100, Gabriel Barazer wrote:
 Oops (tm)! Fortunately I do mostly reads, but maybe the exports(5) man 
 page should be updated. According to the man page, I thought that 
 although writes aren't commited to the block devices, the server-side 
 cache is correctly synchronized (but lost if you pull the plug).

...or if the server crashes for some reason.

 Thanks 
 for the explanation. Having a battery backed large write cache on the 
 server, is there a performance hit when switching from async to sync in 
 NFSv3 ?

The main performance hits occur on operations like create(), mkdir(),
rename and unlink() since they are required to be immediately synced to
disk.
IOW: there will be a noticeable overhead when writing lots of small
files.

For large files, the overhead should be minimal, since all writes can be
cached by the server until the close() operation.

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] NLM: Convert lockd to use kthreads

2008-02-06 Thread Jeff Layton
On Wed, 06 Feb 2008 13:36:31 -0500
Trond Myklebust [EMAIL PROTECTED] wrote:

 
 On Wed, 2008-02-06 at 13:21 -0500, Jeff Layton wrote:
  Have lockd_up start lockd using kthread_run. With this change,
  lockd_down now blocks until lockd actually exits, so there's no
  longer need for the waitqueue code at the end of lockd_down. This
  also means that only one lockd can be running at a time which
  simplifies the code within lockd's main loop.
  
  This also adds a check for kthread_should_stop in the main loop of
  nlmsvc_retry_blocked and after that function returns. There's no
  sense continuing to retry blocks if lockd is coming down anyway.
  
  The main difference between this patch and earlier ones is that it
  changes lockd_down to again send SIGKILL to lockd when it's coming
  down. svc_recv() uses schedule_timeout, so we can end up blocking
  there for a long time if we end up calling into it after
  kthread_stop wakes up lockd. Sending a SIGKILL should help ensure
  that svc_recv returns quickly if this occurs.
  
  Because kthread_stop blocks until the kthread actually goes down,
  we have to send the signal before calling it. This means that there
  is a very small race window like this where lockd_down could block
  for a long time:
 
 Having looked again at the code, could you please remind me _why_ we
 need to signal the process?
 
 AFAICS, kthread_stop() should normally wake the process up if it is in
 the schedule_timeout() state in svc_recv() since it uses
 wake_up_process(). Shouldn't the only difference be that svc_recv()
 will return -EAGAIN instead of -EINTR?
 
 If so, why can't we just forgo the signal?
 

There's no guarantee that kthread_stop() won't wake up lockd before
schedule_timeout() gets called, but after the last check for
kthread_should_stop().

-- 
Jeff Layton [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] NLM: Convert lockd to use kthreads

2008-02-06 Thread Trond Myklebust

On Wed, 2008-02-06 at 13:47 -0500, Jeff Layton wrote:
 There's no guarantee that kthread_stop() won't wake up lockd before
 schedule_timeout() gets called, but after the last check for
 kthread_should_stop().

Doesn't the BKL pretty much eliminate this race? (assuming you transform
that call to 'if (!kthread_should_stop()) schedule_timeout();')

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] NLM: Convert lockd to use kthreads

2008-02-06 Thread Jeff Layton
On Wed, 06 Feb 2008 13:52:34 -0500
Trond Myklebust [EMAIL PROTECTED] wrote:

 
 On Wed, 2008-02-06 at 13:47 -0500, Jeff Layton wrote:
  There's no guarantee that kthread_stop() won't wake up lockd before
  schedule_timeout() gets called, but after the last check for
  kthread_should_stop().
 
 Doesn't the BKL pretty much eliminate this race? (assuming you
 transform that call to 'if (!kthread_should_stop())
 schedule_timeout();')
 
 Trond
 

I don't think so. That would require that lockd_down is always called
with the BKL held, and I don't think it is, is it?
-- 
Jeff Layton [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS EINVAL on open(... | O_TRUNC) on 2.6.23.9

2008-02-06 Thread Chuck Lever

Hi Gianluca-

On Feb 6, 2008, at 1:25 PM, Gianluca Alberici wrote:

Hello all,

Thanks to Chuck's help i finally decided to proceed to a git bisect  
and found the bad patch. Is there anybody that has an idea why it  
breaks userspace nfs servers as we have seen ? Sorry for emailing  
directly Chuck Lever and Andrew Morton but i really wanted to thank  
Chuck for his precious help and thought that /akpm/ having signed  
this commit maybe he's going to figure out whats wrong easily


The commit you found is a plausible source of the trouble (based on  
our current theory about the problem).


What isn't quite clear to me is whether this commit causes your user- 
space server to start failing suddenly, or it causes the client to  
start sending the special non-standard time stamps in the SETATTR  
request.  My guess is the latter, but I want to confirm this guess  
against reality  :-)


Are you running the client and server concurrently on the same  
system?  If so, it would be helpful if you could run this test with a  
constant kernel version on one side while varying it on the other.   
If client and server are already on different systems, can you tell  
us which system and which kernel combinations caused the failure?


A matrix of combinations might be:

1. server kernel is before 1c710c89, client kernel is before 1c710c89
2. server kernel is before 1c710c89, client kernel is after 1c710c89
3. server kernel is after 1c710c89, client kernel is before 1c710c89
4. server kernel is after 1c710c89, client kernel is after 1c710c89

Thanks.


This is what i finally get from git:

1c710c896eb461895d3c399e15bb5f20b39c9073 is first bad commit
commit 1c710c896eb461895d3c399e15bb5f20b39c9073
Author: Ulrich Drepper [EMAIL PROTECTED]
Date:   Tue May 8 00:33:25 2007 -0700

   utimensat implementation

   Implement utimensat(2) which is an extension to futimesat(2) in  
that it


   a) supports nano-second resolution for the timestamps
   b) allows to selectively ignore the atime/mtime value
   c) allows to selectively use the current time for either atime  
or mtime
   d) supports changing the atime/mtime of a symlink itself along  
the lines

  of the BSD lutimes(3) functions

[...]

   [EMAIL PROTECTED]: add missing i386 syscall table entry]
   Signed-off-by: Ulrich Drepper [EMAIL PROTECTED]
   Cc: Alexey Dobriyan [EMAIL PROTECTED]
   Cc: Michael Kerrisk [EMAIL PROTECTED]
   Cc: [EMAIL PROTECTED]
   Signed-off-by: Andrew Morton [EMAIL PROTECTED]
   Signed-off-by: Linus Torvalds [EMAIL PROTECTED]

:04 04 3bedbc7fd919ba167b8e5f208a630261570853bb  
927002a9423dcb51ba4f7bee53e60cdca6c1df43 M  arch
:04 04 fd688c5b534efd3111cbf1e1095d6ff631738325  
3d0fbf20fb3da1cb380c92f5b2b39815897376d3 M  fs
:04 04 bfb1a907a9a842db4fa3543e12a8381d4e11b1eb  
9c1d99324db12e066c0d17870fe48457809ad43b M  include


Thanks in advance, regards,

Gianluca


Hi Gianluca-

On Jan 30, 2008, at 7:40 AM, Gianluca Alberici wrote:


Hello again everybody

Here follows the testbench:

- I got two mirrors, same machine, same disk etc...chaged  
hostname, IP, and on the second i have recompiled kernel.

- First: 2.6.21.7 on debian sarge
- Second: 2.6.22 same system.
- Onto both i got nfs-user-server and cfsd last versions
- The export file is the same (localhost /opt/nfs (rw, async),  
stripping off the async option does not changes anything)

- Mount options are exactly the same.

The problem arises in the very same manner with both nfs and cfsd:

NFS:setattr {
...
...
RPC:call_decode {
return 22;
}
...
return 22;
}



Again, there is nothing wrong with the RPC client or call_decode.  
The *server* is returning NFSERR_INVAL (22) to a SETATTR request;  
the RPC client is simply passing that along to the NFS client, as  
it is designed to do.



I have tried these kernels:

2.6.16.11 works
2.6.20 works
2.6.21 works
2.6.21.7 works
2.6.22 doesnt work (contiguous to previous version)
2.6.23 doesnt work (same behavior as previous)
2.6.23.9 doesnt work (as above)
2.6.24rc7 doesnt work (as above)

I would really like to do more, client or server side, if you ave  
any suggestions.
Can we find out what is the change (doesnt matter if it is a buf  
or bug fix) that caused this problem ?



The goal here is to identify the kernel change between 2.6.21 and  
2.6.22 that makes the client generate SETATTR requests the user- 
space server chokes on. It may be a change in the NFS client, or  
it could be somewhere else in the file system stack, like the VFS.


The usual procedure is to use git bisect. It does a binary  
search on the kernel patches between the working kernel version  
and the kernel version that is known not to work. It works like this:


1. You clone a linux kernel git repository (if you don't have a git
repository already)

2. You tell git bisect which kernel version is working, and which  
isn't.
git bisect then selects a commit about half way in between the  
working

and non-working versions, and checks 

Re: [PATCH 2/2] NLM: Convert lockd to use kthreads

2008-02-06 Thread Jeff Layton
On Wed, 6 Feb 2008 13:47:02 -0500
Jeff Layton [EMAIL PROTECTED] wrote:

 On Wed, 06 Feb 2008 13:36:31 -0500
 Trond Myklebust [EMAIL PROTECTED] wrote:
 
  
  On Wed, 2008-02-06 at 13:21 -0500, Jeff Layton wrote:
   Have lockd_up start lockd using kthread_run. With this change,
   lockd_down now blocks until lockd actually exits, so there's no
   longer need for the waitqueue code at the end of lockd_down. This
   also means that only one lockd can be running at a time which
   simplifies the code within lockd's main loop.
   
   This also adds a check for kthread_should_stop in the main loop of
   nlmsvc_retry_blocked and after that function returns. There's no
   sense continuing to retry blocks if lockd is coming down anyway.
   
   The main difference between this patch and earlier ones is that it
   changes lockd_down to again send SIGKILL to lockd when it's coming
   down. svc_recv() uses schedule_timeout, so we can end up blocking
   there for a long time if we end up calling into it after
   kthread_stop wakes up lockd. Sending a SIGKILL should help ensure
   that svc_recv returns quickly if this occurs.
   
   Because kthread_stop blocks until the kthread actually goes down,
   we have to send the signal before calling it. This means that
   there is a very small race window like this where lockd_down
   could block for a long time:
  
  Having looked again at the code, could you please remind me _why_ we
  need to signal the process?
  
  AFAICS, kthread_stop() should normally wake the process up if it is
  in the schedule_timeout() state in svc_recv() since it uses
  wake_up_process(). Shouldn't the only difference be that svc_recv()
  will return -EAGAIN instead of -EINTR?
  
  If so, why can't we just forgo the signal?
  
 
 There's no guarantee that kthread_stop() won't wake up lockd before
 schedule_timeout() gets called, but after the last check for
 kthread_should_stop().
 

Sorry, I hit send too quick...

I'm certainly open to alternatives to signaling, but having a pending
signal seems to be the best way to ensure that we don't end up blocking
in schedule_timeout() here.

As a side note, I've rolled up a patch to add a kthread_stop_sig()
variant that will use force_sig to wake up a kthread instead of just
waking it up. I've not tested it yet, but once I do and if we can get
it in then we should be able to close the race I'm talking about in
this patch description as well...

-- 
Jeff Layton [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS performance (Currently 2.6.20)

2008-02-06 Thread Jesper Krogh

Gabriel Barazer wrote:

On 02/06/2008 4:59:39 PM +0100, Jesper Krogh [EMAIL PROTECTED] wrote:


I have a similar setup, and I'm very curious on how you can read an
iowait value from the clients: On my nodes (server 2.6.21.5/clients
2.6.23.14), the iowait counter is only incremented when dealing with
block devices, and since my nodes are diskless my iowait is near 0%.


Output in top is like this:
top - 16:51:01 up 119 days,  6:10,  1 user,  load average: 2.09, 2.00, 
1.41

Tasks:  74 total,   2 running,  72 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.2%us,  0.0%sy,  0.0%ni, 50.0%id, 49.8%wa,  0.0%hi,  0.0%si, 
0.0%st

Mem:   2060188k total,  2047488k used,12700k free, 2988k buffers
Swap:  4200988k total,42776k used,  4158212k free,  1985500k cached


You have obviously a block device on your nodes, so I suspect that 
something is reading/writing to it. Looking at how much memory is used, 
your system must be constantly swapping. This could explain why your 
iowait is so high (if your swap space is a block device or a file on a 
block device. You don't use swap over NFS do you?)


No swap over NFS and no swapping at all.

A vmstat 1 output of the above situation looks like:
procs ---memory-- ---swap-- -io -system-- 
cpu
 0  2  42768  11580   1368 198733600 0 0  638  366  1 
0 50 48
 0  2  42768  13088   1368 198592400 0 0  695  367  2 
1 50 47
 0  2  42768  13028   1368 198611200 0 0  345  129  0 
0 50 50
 1  1  42768  12720   1364 198632800 0 0 1043  710  6 
1 50 42
 0  1  42768  12648   1364 198730800 0 0  636  374  2 
4 50 44
 0  2  42768  11608   1364 198843600 0 0  696  382  1 
0 51 49


You can also see that there barely is used any swap in the top report.

Jesper
--
Jesper
-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: (fwd) nfs hang on 2.6.24

2008-02-06 Thread J. Bruce Fields
On Thu, Feb 07, 2008 at 10:19:06AM +1300, Andrew Dixie wrote:
 
  Oh, right, I was confusing client and server reboot and assuming the
  client would forget the uniquifier on server reboot.  That's obviously
  wrong!  The client will forget its own uniquifier on client reboot, but
  that's alright since it's happy enough just to let that old state time
  out at that point.  So the only possible problem is suboptimal behavior
  when the client reboot time is less than the lease time.
 
 There is one client, a stable connection between client and server, and
 neither client or server are being rebooted.
 Are the string in use by client messages still expected?

Assuming the client creates and destroys clientid's on demand, as
they're needed for opens, and uses whatever user credential it has at
hand to do so, then I think a sequence of alternating opens and closes
as different users could produce this.

But no, it doesn't indicate any real problem on its own.

 Below is a program that attempts to open a file that is contained in a
 directory that has been deleted by another client.
 
 I'm not sure these are conditions that are normally occuring, it's just
 something I encountered trying to reproduce the hang.
 
 This reliably reproduces:
 Feb  7 09:55:01 devfile kernel: NFSD: preprocess_seqid_op: bad seqid
 (expected 20, got 22)

That's a bug though, either on the client or server side.

--b.

 
 And about 1 in 10 times it also reproduces:
 Feb  7 09:55:01 devfile kernel: NFSD: setclientid: string in use by
 client(clientid 47a627bd/044b)
 
 The server is 2.6.18-5 from debian.
 
 ---
 
 #include string.h
 #include stdio.h
 #include unistd.h
 #include errno.h
 #include fcntl.h
 #include sys/stat.h
 #include stdlib.h
 
 #define ASSERT(x) \
 if (!(x)) { fprintf(stderr, %s:%i:assert: #x \n, __FILE__,
 __LINE__); abort(); }
 
 #define testdir /home/andrewd/testdir
 #define testfile testdir /fred
 
 int main(int argc, char *argv[])
 {
 int fd;
 int rv;
 
 rv = mkdir(testdir,0777);
 ASSERT(rv == 0 || errno == EEXIST);
 
 fd = open(testfile, O_CREAT|O_WRONLY);
 ASSERT(fd != -1);
 rv = write(fd, stuff\n, 6);
 ASSERT(rv == 6);
 close(fd);
 
 rv = access(testfile, 0);
 ASSERT(rv == 0);
 
 // Remove directory via another client (nfsv3)
 system(ssh devlin7 rm -r testdir);
 
 // Try to open file
 fd = open(testfile, O_RDONLY);
 printf(got fd:%i errno:%i\n, fd, errno);
 // fd == -1, errno = ENOENT
 // This is expected, error on nfs server is not.
 return 0;
 }
 
 
-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS EINVAL on open(... | O_TRUNC) on 2.6.23.9

2008-02-06 Thread Andrew Morton
On Wed, 06 Feb 2008 22:55:02 +0100
Gianluca Alberici [EMAIL PROTECTED] wrote:

 I finally got it. Problem and solution have been found from 6 month but 
 nobody cared...up to now those servers have not been mantained, this 
 problem is not discussed anywhere else than the following link.
 The bug (userspace server side i would say at this point) is well 
 described from the author of an nfs-user-server patch which has not been 
 managed yet. The magic hint to find it on google was 'nfs server 
 utimensat' :-)
 
 http://marc.info/?l=linux-nfsm=118724649406144w=2

This is pretty significant.  We have on several occasions in recent years
tightened up the argument checking on long-standing system calls and it's
always a concern that this will break previously-working applications.

And now it has happened.

If we put buggy code into the kernel then we're largely stuck with it: we
need to be back-compatible with our bugs so we don't break things like
this.

 I have already prepared a working patch for cfsd based upon the one ive 
 listed. The nfs patch is of course waiting for commit since august, 
 2007. Ill submit it to debian cfsd mantainers, hoping to have more 
 chance than my predecessor.
 It doesnt seem to me that there was any kernel related issue.
 
 Thanks a lot again, sorry for the lots of noise i have done. I will try 
 to be more appropriate next time.

That wasn't noise - it was quite valuable.  Thanks for all the work you did
on this.


Given that our broken-by-unbreaking code has been out there in several
releases there isn't really any point in rebreaking it to fix this - the
offending applications need to be repaired so they'll work on 2.6.22 and
2.6.23 anyway.

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: (fwd) nfs hang on 2.6.24

2008-02-06 Thread Andrew Dixie

 Oh, right, I was confusing client and server reboot and assuming the
 client would forget the uniquifier on server reboot.  That's obviously
 wrong!  The client will forget its own uniquifier on client reboot, but
 that's alright since it's happy enough just to let that old state time
 out at that point.  So the only possible problem is suboptimal behavior
 when the client reboot time is less than the lease time.

There is one client, a stable connection between client and server, and
neither client or server are being rebooted.
Are the string in use by client messages still expected?

Below is a program that attempts to open a file that is contained in a
directory that has been deleted by another client.

I'm not sure these are conditions that are normally occuring, it's just
something I encountered trying to reproduce the hang.

This reliably reproduces:
Feb  7 09:55:01 devfile kernel: NFSD: preprocess_seqid_op: bad seqid
(expected 20, got 22)

And about 1 in 10 times it also reproduces:
Feb  7 09:55:01 devfile kernel: NFSD: setclientid: string in use by
client(clientid 47a627bd/044b)

The server is 2.6.18-5 from debian.

---

#include string.h
#include stdio.h
#include unistd.h
#include errno.h
#include fcntl.h
#include sys/stat.h
#include stdlib.h

#define ASSERT(x) \
if (!(x)) { fprintf(stderr, %s:%i:assert: #x \n, __FILE__,
__LINE__); abort(); }

#define testdir /home/andrewd/testdir
#define testfile testdir /fred

int main(int argc, char *argv[])
{
int fd;
int rv;

rv = mkdir(testdir,0777);
ASSERT(rv == 0 || errno == EEXIST);

fd = open(testfile, O_CREAT|O_WRONLY);
ASSERT(fd != -1);
rv = write(fd, stuff\n, 6);
ASSERT(rv == 6);
close(fd);

rv = access(testfile, 0);
ASSERT(rv == 0);

// Remove directory via another client (nfsv3)
system(ssh devlin7 rm -r testdir);

// Try to open file
fd = open(testfile, O_RDONLY);
printf(got fd:%i errno:%i\n, fd, errno);
// fd == -1, errno = ENOENT
// This is expected, error on nfs server is not.
return 0;
}


-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: (fwd) nfs hang on 2.6.24

2008-02-06 Thread Andrew Dixie
   What is rpciod doing while the machine hangs?
   Does 'netstat -t' show an active tcp connection to the server?
   Does tcpdump show any traffic going on the wire?
   What server are you running against? From the error messages below, I
 see it is a Linux machine, but which kernel is it running?

Server is 2.6.18-5 from debian.

From /proc/mounts:

server1:/files /files nfs
rw,vers=3,rsize=8192,wsize=8192,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=10.64.2.90
0 0
devfile:/srv/linshared_srv /srv nfs
rw,vers=3,rsize=32768,wsize=32768,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=10.64.2.21
0 0
devfile:/home /home nfs4
rw,vers=4,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=3,sec=sys,addr=10.64.2.21
0 0

The nfs connections went into CLOSE_WAIT:
tcp0  0 10.64.2.25:888  10.64.2.21:2049
CLOSE_WAIT
tcp0  0 10.64.2.25:974  10.64.2.21:2049
CLOSE_WAIT

I can't see any traffic for it attempting to reconnect.

Below are the rpciod stacktraces from the previous hang.
Also rpc.idmap looks to be in the middle of something.

Cheers,
Andrew

rpciod/0  S f76f9e7c 0  2663  2
   f7d7c1f0 0046 0002 f76f9e7c f76f9e74  0286
f669bc00
   f7d7c358 c180a940  015b37db f669bc00 dfbc8c80 00ff

     f76f9ebc  f76f9ec4 c180284c f8c62e85
c02bc97f
Call Trace:
 [f8c62e85] rpc_wait_bit_interruptible+0x1a/0x1f [sunrpc]
 [c02bc97f] __wait_on_bit+0x33/0x58
 [f8c62e6b] rpc_wait_bit_interruptible+0x0/0x1f [sunrpc]
 [f8c62e6b] rpc_wait_bit_interruptible+0x0/0x1f [sunrpc]
 [c02bca07] out_of_line_wait_on_bit+0x63/0x6b
 [c013545e] wake_bit_function+0x0/0x3c
 [f8c62e19] __rpc_wait_for_completion_task+0x32/0x39 [sunrpc]
 [f8ce1352] nfs4_wait_for_completion_rpc_task+0x1b/0x2f [nfs]
 [f8ce2336] nfs4_proc_delegreturn+0x116/0x172 [nfs]
 [f8c63411] rpc_async_schedule+0x0/0xa [sunrpc]
 [f8ced370] nfs_do_return_delegation+0xf/0x1d [nfs]
 [f8cd135f] nfs_dentry_iput+0xd/0x49 [nfs]
 [c01865d2] dentry_iput+0x74/0x93
 [c018666d] d_kill+0x2d/0x46
 [c0186970] dput+0xd5/0xdc
 [f8ce4016] nfs4_free_closedata+0x26/0x41 [nfs]
 [f8c62c8d] rpc_release_calldata+0x16/0x20 [sunrpc]
 [c013220d] run_workqueue+0x7d/0x109
 [c0132a83] worker_thread+0x0/0xc5
 [c0132b3d] worker_thread+0xba/0xc5
 [c0135429] autoremove_wake_function+0x0/0x35
 [c0135362] kthread+0x38/0x5e
 [c013532a] kthread+0x0/0x5e
 [c0104b0f] kernel_thread_helper+0x7/0x10

rpciod/1-3 identical:
   df848710 0046 0002 f76fbfa0 f76fbf98  f8c633fd
0572
   df848878 c1812940 0001 015b36d3 df9abc08 f8c63411 00ff

     f776a840 c0132a83 f76fbfd0  c0132b0b

Call Trace:
 [f8c633fd] __rpc_execute+0x21d/0x231 [sunrpc]
 [f8c63411] rpc_async_schedule+0x0/0xa [sunrpc]
 [c0132a83] worker_thread+0x0/0xc5
 [c0132b0b] worker_thread+0x88/0xc5
 [c0135429] autoremove_wake_function+0x0/0x35
 [c0135362] kthread+0x38/0x5e
 [c013532a] kthread+0x0/0x5e
 [c0104b0f] kernel_thread_helper+0x7/0x10
 ===
rpc.idmapdS f777ff10 0  2687  1
   f7cea610 0086 0002 f777ff10 f777ff08  

   f7cea778 c1822940 0003 015d5741   00ff

     7fff f75e2b00 080536e8 0286 c02bc7f1

Call Trace:
 [c01355e8] add_wait_queue+0x12/0x32
 [c017d287] pipe_poll+0x24/0x7d
 [c0183476] do_select+0x365/0x3bc
 [c0183a60] __pollwait+0x0/0xac
 [c011f44f] default_wake_function+0x0/0x8
message repeated 10 times
 [c0259bb5] skb_release_all+0xa3/0xfa
 [c025e590] dev_hard_start_xmit+0x20c/0x277
 [c026d227] __qdisc_run+0x9e/0x164
 [c02564e7] sk_reset_timer+0xc/0x16
 [c0260758] dev_queue_xmit+0x288/0x2b0
 [c026b72e] eth_header+0x0/0xb6
 [c0264fe5] neigh_resolve_output+0x203/0x235
 [c027dd59] ip_finish_output+0x0/0x208
 [c027df29] ip_finish_output+0x1d0/0x208
 [c027edd1] ip_output+0x7d/0x92
 [c01e240c] number+0x147/0x215
 [c0183750] core_sys_select+0x283/0x2a0
 [c01e2d23] vsnprintf+0x440/0x47c
 [c0187123] d_lookup+0x1b/0x3b
 [c01a5fe3] proc_flush_task+0x12b/0x235
 [c0135a53] posix_cpu_timers_exit_group+0x4a/0x50
 [c0108472] convert_fxsr_from_user+0x15/0xd5
 [c0183be2] sys_select+0xd6/0x187
 [c018a6ce] mntput_no_expire+0x11/0x66
 [c0176b05] filp_close+0x51/0x58
 [c012743f] sys_wait4+0x31/0x34
 [c0103e5e] sysenter_past_esp+0x6b/0xa1


-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: (fwd) nfs hang on 2.6.24

2008-02-06 Thread Trond Myklebust

On Thu, 2008-02-07 at 11:40 +1300, Andrew Dixie wrote:
What is rpciod doing while the machine hangs?
Does 'netstat -t' show an active tcp connection to the server?
Does tcpdump show any traffic going on the wire?
What server are you running against? From the error messages below, I
  see it is a Linux machine, but which kernel is it running?
 
 Server is 2.6.18-5 from debian.
 
 From /proc/mounts:
 
 server1:/files /files nfs
 rw,vers=3,rsize=8192,wsize=8192,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=10.64.2.90
 0 0
 devfile:/srv/linshared_srv /srv nfs
 rw,vers=3,rsize=32768,wsize=32768,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=10.64.2.21
 0 0
 devfile:/home /home nfs4
 rw,vers=4,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=3,sec=sys,addr=10.64.2.21
 0 0
 
 The nfs connections went into CLOSE_WAIT:
 tcp0  0 10.64.2.25:888  10.64.2.21:2049
 CLOSE_WAIT
 tcp0  0 10.64.2.25:974  10.64.2.21:2049
 CLOSE_WAIT
 
 I can't see any traffic for it attempting to reconnect.
 
 Below are the rpciod stacktraces from the previous hang.
 Also rpc.idmap looks to be in the middle of something.
 
 Cheers,
 Andrew
 
 rpciod/0  S f76f9e7c 0  2663  2
f7d7c1f0 0046 0002 f76f9e7c f76f9e74  0286
 f669bc00
f7d7c358 c180a940  015b37db f669bc00 dfbc8c80 00ff
 
  f76f9ebc  f76f9ec4 c180284c f8c62e85
 c02bc97f
 Call Trace:
  [f8c62e85] rpc_wait_bit_interruptible+0x1a/0x1f [sunrpc]
  [c02bc97f] __wait_on_bit+0x33/0x58
  [f8c62e6b] rpc_wait_bit_interruptible+0x0/0x1f [sunrpc]
  [f8c62e6b] rpc_wait_bit_interruptible+0x0/0x1f [sunrpc]
  [c02bca07] out_of_line_wait_on_bit+0x63/0x6b
  [c013545e] wake_bit_function+0x0/0x3c
  [f8c62e19] __rpc_wait_for_completion_task+0x32/0x39 [sunrpc]
  [f8ce1352] nfs4_wait_for_completion_rpc_task+0x1b/0x2f [nfs]
  [f8ce2336] nfs4_proc_delegreturn+0x116/0x172 [nfs]
  [f8c63411] rpc_async_schedule+0x0/0xa [sunrpc]
  [f8ced370] nfs_do_return_delegation+0xf/0x1d [nfs]
  [f8cd135f] nfs_dentry_iput+0xd/0x49 [nfs]
  [c01865d2] dentry_iput+0x74/0x93
  [c018666d] d_kill+0x2d/0x46
  [c0186970] dput+0xd5/0xdc
  [f8ce4016] nfs4_free_closedata+0x26/0x41 [nfs]
  [f8c62c8d] rpc_release_calldata+0x16/0x20 [sunrpc]
  [c013220d] run_workqueue+0x7d/0x109
  [c0132a83] worker_thread+0x0/0xc5
  [c0132b3d] worker_thread+0xba/0xc5
  [c0135429] autoremove_wake_function+0x0/0x35
  [c0135362] kthread+0x38/0x5e
  [c013532a] kthread+0x0/0x5e
  [c0104b0f] kernel_thread_helper+0x7/0x10

That's the bug right there. rpciod should never be calling a
synchrounous RPC call.

I've already got a fix for this bug against 2.6.24. Could you see if it
applies to your kernel too?

Cheers
  Trond
---BeginMessage---
Otherwise, there is a potential deadlock if the last dput() from an NFSv4
close() or other asynchronous operation leads to nfs_clear_inode calling
the synchronous delegreturn.

Signed-off-by: Trond Myklebust [EMAIL PROTECTED]
---

 fs/nfs/delegation.c |   29 +
 fs/nfs/delegation.h |3 ++-
 fs/nfs/dir.c|1 -
 fs/nfs/inode.c  |2 +-
 fs/nfs/nfs4proc.c   |   22 +-
 5 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index b03dcd8..2dead8d 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -174,11 +174,11 @@ int nfs_inode_set_delegation(struct inode *inode, struct 
rpc_cred *cred, struct
return status;
 }
 
-static int nfs_do_return_delegation(struct inode *inode, struct nfs_delegation 
*delegation)
+static int nfs_do_return_delegation(struct inode *inode, struct nfs_delegation 
*delegation, int issync)
 {
int res = 0;
 
-   res = nfs4_proc_delegreturn(inode, delegation-cred, 
delegation-stateid);
+   res = nfs4_proc_delegreturn(inode, delegation-cred, 
delegation-stateid, issync);
nfs_free_delegation(delegation);
return res;
 }
@@ -208,7 +208,7 @@ static int __nfs_inode_return_delegation(struct inode 
*inode, struct nfs_delegat
up_read(clp-cl_sem);
nfs_msync_inode(inode);
 
-   return nfs_do_return_delegation(inode, delegation);
+   return nfs_do_return_delegation(inode, delegation, 1);
 }
 
 static struct nfs_delegation *nfs_detach_delegation_locked(struct nfs_inode 
*nfsi, const nfs4_stateid *stateid)
@@ -228,6 +228,27 @@ nomatch:
return NULL;
 }
 
+/*
+ * This function returns the delegation without reclaiming opens
+ * or protecting against delegation reclaims.
+ * It is therefore really only safe to be called from
+ * nfs4_clear_inode()
+ */
+void nfs_inode_return_delegation_noreclaim(struct inode *inode)
+{
+   struct nfs_client *clp = NFS_SERVER(inode)-nfs_client;
+   struct nfs_inode *nfsi = NFS_I(inode);
+   struct nfs_delegation *delegation;
+
+   if (rcu_dereference(nfsi-delegation) != NULL) {
+  

Re: [PATCH 2/2] NLM: Convert lockd to use kthreads

2008-02-06 Thread Trond Myklebust

On Wed, 2008-02-06 at 14:09 -0500, Jeff Layton wrote:
 On Wed, 06 Feb 2008 13:52:34 -0500
 Trond Myklebust [EMAIL PROTECTED] wrote:
 
  
  On Wed, 2008-02-06 at 13:47 -0500, Jeff Layton wrote:
   There's no guarantee that kthread_stop() won't wake up lockd before
   schedule_timeout() gets called, but after the last check for
   kthread_should_stop().
  
  Doesn't the BKL pretty much eliminate this race? (assuming you
  transform that call to 'if (!kthread_should_stop())
  schedule_timeout();')
  
  Trond
  
 
 I don't think so. That would require that lockd_down is always called
 with the BKL held, and I don't think it is, is it?

Nothing stops you from grabbing the BKL inside lockd_down, though :-)

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NFS] [patch 59/73] knfsd: Allow NFSv2/3 WRITE calls to succeed when krb5i etc is used.

2008-02-06 Thread Greg KH

2.6.23-stable review patch.  If anyone has any objections, please let us know.
--
From: NeilBrown [EMAIL PROTECTED]

patch ba67a39efde8312e386c6f603054f8945433d91f in mainline.

When RPCSEC/GSS and krb5i is used, requests are padded, typically to a multiple
of 8 bytes.  This can make the request look slightly longer than it
really is.

As of

f34b95689d2ce001c The NFSv2/NFSv3 server does not handle zero
length WRITE request correctly,

the xdr decode routines for NFSv2 and NFSv3 reject requests that aren't
the right length, so krb5i (for example) WRITE requests can get lost.

This patch relaxes the appropriate test and enhances the related comment.

Signed-off-by: Neil Brown [EMAIL PROTECTED]
Signed-off-by: J. Bruce Fields [EMAIL PROTECTED]
Cc: Peter Staubach [EMAIL PROTECTED]
Signed-off-by: Linus Torvalds [EMAIL PROTECTED]
Signed-off-by: Greg Kroah-Hartman [EMAIL PROTECTED]

---
 fs/nfsd/nfs3xdr.c |5 -
 fs/nfsd/nfsxdr.c  |5 -
 2 files changed, 8 insertions(+), 2 deletions(-)

--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -396,8 +396,11 @@ nfs3svc_decode_writeargs(struct svc_rqst
 * Round the length of the data which was specified up to
 * the next multiple of XDR units and then compare that
 * against the length which was actually received.
+* Note that when RPCSEC/GSS (for example) is used, the
+* data buffer can be padded so dlen might be larger
+* than required.  It must never be smaller.
 */
-   if (dlen != XDR_QUADLEN(len)*4)
+   if (dlen  XDR_QUADLEN(len)*4)
return 0;
 
if (args-count  max_blocksize) {
--- a/fs/nfsd/nfsxdr.c
+++ b/fs/nfsd/nfsxdr.c
@@ -313,8 +313,11 @@ nfssvc_decode_writeargs(struct svc_rqst 
 * Round the length of the data which was specified up to
 * the next multiple of XDR units and then compare that
 * against the length which was actually received.
+* Note that when RPCSEC/GSS (for example) is used, the
+* data buffer can be padded so dlen might be larger
+* than required.  It must never be smaller.
 */
-   if (dlen != XDR_QUADLEN(len)*4)
+   if (dlen  XDR_QUADLEN(len)*4)
return 0;
 
rqstp-rq_vec[0].iov_base = (void*)p;

-- 

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
NFS maillist  -  [EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nfs
___
Please note that [EMAIL PROTECTED] is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
http://vger.kernel.org/vger-lists.html#linux-nfs

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS+krb5: Failed to create krb5 context for user with uid 0

2008-02-06 Thread Luke Cyca

On Feb 5, 2008, at 9:12 PM, Kevin Coffman wrote:

If the Mac server code can support other encryption types like Triple
DES and ArcFour, you shouldn't need to limit it to only the
des-cbc-crc key.  The Linux nfs-utils code on the client should be
limiting the negotiated encryption type to des.

I would assume if normal users are able to get a context and talk to
the server, that root using the keytab should be able to do so as
well.



I added a principal for root/[EMAIL PROTECTED] and added  
it to the client's keytab and everything appears to work now.


I then put the other keys back on the server's keytab as you suggested.

Thanks for the help!


Luke


Notice of Confidentiality: The information transmitted is intended only for the
person or entity to which it is addressed and may contain confidential and/or
privileged material. Any review, re-transmission, dissemination or other use of
or taking of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this in error
please contact the sender immediately by return electronic transmission and then
immediately delete this transmission including all attachments without copying,
distributing or disclosing the same.


-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: kernel exports table flushes out on running exportfs -a over mips

2008-02-06 Thread Anirban Sinha
Hi:

I did some extensive digging into the codebase and I believe I have the
reason
why exportfs -a flushes out the caches after NFS clients have mounted
the NFS filesystem. 
The analysis is complicated, but here's
the crux of the matter: 

There is a difference in the /etc/exports and the kernel maintained
cache. The
difference is that in /etc/exports, we use anonymous clients (*) whereas
kernel
maintains a FQDN client names in its exports cache (see attached file).
This
difference (the parsing code client_gettype() specifically checks for a
* or an IP or hostname among other things and based on that creates two
different types of caches) is causing the nfs codebase to recreate new
in core exports entries (the second time when we issue exportfs -a)
after parsing /proc/fs/nfs/export. Immediately later, it then throws
these away (for these
newly created entries, m_mayexport = 0 and m_exported = 1 in function
xtab_read()). For details, see the logic in exports_update_one():

if (exp-m_exported  !exp-m_mayexport) { ... unexporting ...}


Since both the anonymous and FQDN entries are essentially the same, this
results in blowing away the existing kernel exports table. 

My question is, is there a elegant solution to this problem without
simply using FQDN in /etc/exports? I have confirmed that the problem
does not occur when both the in kernel and /etc/exports tables have same
entries (both * or both FQDN).

Cheers,

Ani


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:linux-nfs-
 [EMAIL PROTECTED] On Behalf Of Anirban Sinha
 Sent: Thursday, January 31, 2008 2:09 PM
 To: Greg Banks
 Cc: linux-nfs@vger.kernel.org
 Subject: RE: kernel exports table flushes out on running exportfs -a
 over mips
 
 Hi Greg:
 
 Thanks for replying. Here goes my response:
 
  -Original Message-
  From: Greg Banks [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, January 30, 2008 6:37 PM
  To: Anirban Sinha
  Cc: linux-nfs@vger.kernel.org
  Subject: Re: kernel exports table flushes out on running exportfs -a
  over mips
 
  On Wed, Jan 30, 2008 at 05:34:13PM -0800, Anirban Sinha wrote:
   Hi:
  
   I am seeing an unusual problem on running nfs server on mips. Over
  Intel
   this does not happen. When I run exportfs -a on the server when
the
   clients have already mounted their nfs filesystem, the kernel
 exports
   table as can be seen from /proc/fs/nfs/exports gets completely
  flushed
   out. We (me and one another colleague) have done some digging
 (mostly
   looking into nfsutils codebase) and it looks like a kernel side
  issue.
   We had also asked folks in the linux-mips mailing list, but
  apparently
   no one has any clue. I am just hoping that those who are more
  familiar
   with the user level and kernel side of nfs might me something more
 to
   chew on. If you can give any suggestions that will be really
 useful.
  If
   you think the information I provided is not enough, I can give you
  any
   other information you need in this regard.
 
  Does the MIPS box have the /proc/fs/nfsd/ filesystem mounted?
 
 Ahh, I see what you mean. Yes, it is mounted, both /proc/fs/nfsd and
 /proc/fs/nfs. However, I can see from the code that check_new_cache()
 checks for a file filehandle which does not exist in that location.
 To be dead sure, I instrumented the code to insert a perror and it
 returns no such file or directory. The new_cache flag remains 0. Is
 this some sort of kernel bug?
 
 
  Perhaps you could try
 
  1) running exportfs under strace.  I suggest
 strace -o /tmp/s.log -s 1024 exportfs ...
 
 Strace does not work in our environment as it has not been properly
 ported to mips.
 
  2) AND enabling kernel debug messages
 rpcdebug -m nfsd -s export
 rpcdebug -m rpc -s cache
 
 I attach the dmesg output after enabling those flags. Zeugma-x-y are
 the clients to this server. Not sure if it means anything suspicious.
 
 Ani
 
 
 
 
  --
  Greg Banks, RD Software Engineer, SGI Australian Software Group.
  The cake is *not* a lie.
  I don't speak for SGI.
root:[EMAIL PROTECTED]:~# cat /proc/fs/nfs/exports
# Version 1.1
# Path Client(Flags) # IPs
/cf4
zeugma-1-3(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks)
/cf2
zeugma-1-4(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks)
/cf2/cpu3   
zeugma-1-3(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks)
/cf2/cpu4   
zeugma-1-4(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks)
/cf2/cpu2   
zeugma-1-2(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks)
/cf2
zeugma-1-2(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks)
/cf4
zeugma-1-2(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks)
/cf4
zeugma-1-4(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks)
/cf2
zeugma-1-3(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks)


Wondering about NLM_HOST_MAX ... doesn't anyone understand this code?

2008-02-06 Thread Neil Brown

Hi,
 I've been looking at NLM_HOST_MAX in fs/lockd/host.c, as we have a
 patch in SLES that makes it configurable, and the patch needs to
 either go upstream or out the window...

 But the code that uses NLM_HOST_MAX is weird!  Look:

#define NLM_HOST_EXPIRE ((nrhosts  NLM_HOST_MAX)? 300 * HZ : 120 * HZ)
#define NLM_HOST_COLLECT((nrhosts  NLM_HOST_MAX)? 120 * HZ :  60 * HZ)

So if the number of hosts is more than the maximum (64), we *increase*
the expiry time and the garbage collection interval.
You would think they should be decreased when we have exceeded the
max, so we can get rid of more old entries more quickly.  But no, they
are increased.

And in the code where we add a host to the list of hosts:

if (++nrhosts  NLM_HOST_MAX)
next_gc = 0;


So when we go over the limit, we garbage collect straight away, but
almost certainly do nothing because we've just given every host an
extra 3 minutes that it is allowed to live.

We could change the '' to '' which would make the code make sense at
least, but I don't think we want to.  A server could easily have more
than 64 clients doing lock requests, and we don't want to make life
harder for clients just because there are more of them.

I think we should just get rid of NLM_HOST_MAX altogether.  Old hosts
will still go away in a few minutes and pushing them out quickly
shouldn't be needed.

So: any comments on the above or on the patch below.  I've chosen to
go with discard hosts older than 5 minutes every 2 minutes rather
than discard hosts older than 2 minutes every minute even though the
latter is what would have been in effect most of the time, as it seems
more like what was intended.

Thanks,
NeilBrown


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./fs/lockd/host.c |8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff .prev/fs/lockd/host.c ./fs/lockd/host.c
--- .prev/fs/lockd/host.c   2008-02-07 14:20:54.0 +1100
+++ ./fs/lockd/host.c   2008-02-07 14:23:38.0 +1100
@@ -19,12 +19,11 @@
 
 
 #define NLMDBG_FACILITYNLMDBG_HOSTCACHE
-#define NLM_HOST_MAX   64
 #define NLM_HOST_NRHASH32
 #define NLM_ADDRHASH(addr) (ntohl(addr)  (NLM_HOST_NRHASH-1))
 #define NLM_HOST_REBIND(60 * HZ)
-#define NLM_HOST_EXPIRE((nrhosts  NLM_HOST_MAX)? 300 * HZ : 
120 * HZ)
-#define NLM_HOST_COLLECT   ((nrhosts  NLM_HOST_MAX)? 120 * HZ :  60 * HZ)
+#define NLM_HOST_EXPIRE(300 * HZ)
+#define NLM_HOST_COLLECT   (120 * HZ)
 
 static struct hlist_head   nlm_hosts[NLM_HOST_NRHASH];
 static unsigned long   next_gc;
@@ -142,9 +141,6 @@ nlm_lookup_host(int server, const struct
INIT_LIST_HEAD(host-h_granted);
INIT_LIST_HEAD(host-h_reclaim);
 
-   if (++nrhosts  NLM_HOST_MAX)
-   next_gc = 0;
-
 out:
mutex_unlock(nlm_host_mutex);
return host;
-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: kernel exports table flushes out on running exportfs -a over mips

2008-02-06 Thread Anirban Sinha
At a higher level, in general, I think the kernel exports table need not
match /etc/exports at all. When we run exportfs -a again, what the
codebase intends to do is the following:

1. Scan /etc/exports and verify that an entry exists (create one if not)
in its in core exports table. Mark each of these as may_be_exported.

2. Scan /proc and see that each of the entries there has a corresponding
entry in the in core exports table (a matching operation). If not,
create a new entry. Mark all entries from /proc as exported.

3. If there are any entries that are *not* may_be_exported and yet
exported, then issue the right rpc through /proc/net/sunrpc/app
cache to delete that entry from the kernel table.


In this case, the matching operation does not detect that a * in the
hostname essentially means *anyone* can mount the volume, regardless of
their specific names. As a result, duplicate entries are created and
ultimately everything gets flushed out :(

Any elegant suggestion/bugfix will be really appreciated.


Ani


 
 -Original Message-
 From: Anirban Sinha
 Sent: Wednesday, February 06, 2008 6:20 PM
 To: Anirban Sinha; Greg Banks
 Cc: linux-nfs@vger.kernel.org
 Subject: RE: kernel exports table flushes out on running exportfs -a
 over mips
 
 Hi:
 
 I did some extensive digging into the codebase and I believe I have
the
 reason why exportfs -a flushes out the caches after NFS clients have
 mounted the NFS filesystem.
 The analysis is complicated, but here's
 the crux of the matter:
 
 There is a difference in the /etc/exports and the kernel maintained
 cache. The difference is that in /etc/exports, we use anonymous
clients
 (*) whereas kernel maintains a FQDN client names in its exports cache
 (see attached file). This difference (the parsing code
client_gettype()
 specifically checks for a * or an IP or hostname among other things
and
 based on that creates two different types of caches) is causing the
nfs
 codebase to recreate new in core exports entries (the second time when
 we issue exportfs -a) after parsing /proc/fs/nfs/export. Immediately
 later, it then throws these away (for these newly created entries,
 m_mayexport = 0 and m_exported = 1 in function xtab_read()). For
 details, see the logic in exports_update_one():
 
 if (exp-m_exported  !exp-m_mayexport) { ... unexporting ...}
 
 
 Since both the anonymous and FQDN entries are essentially the same,
 this results in blowing away the existing kernel exports table.
 
 My question is, is there a elegant solution to this problem without
 simply using FQDN in /etc/exports? I have confirmed that the problem
 does not occur when both the in kernel and /etc/exports tables have
 same entries (both * or both FQDN).
 
 Cheers,
 
 Ani
 
 
  -Original Message-
  From: [EMAIL PROTECTED] [mailto:linux-nfs-
  [EMAIL PROTECTED] On Behalf Of Anirban Sinha
  Sent: Thursday, January 31, 2008 2:09 PM
  To: Greg Banks
  Cc: linux-nfs@vger.kernel.org
  Subject: RE: kernel exports table flushes out on running exportfs -a
  over mips
 
  Hi Greg:
 
  Thanks for replying. Here goes my response:
 
   -Original Message-
   From: Greg Banks [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, January 30, 2008 6:37 PM
   To: Anirban Sinha
   Cc: linux-nfs@vger.kernel.org
   Subject: Re: kernel exports table flushes out on running exportfs
-
 a
   over mips
  
   On Wed, Jan 30, 2008 at 05:34:13PM -0800, Anirban Sinha wrote:
Hi:
   
I am seeing an unusual problem on running nfs server on mips.
 Over
   Intel
this does not happen. When I run exportfs -a on the server when
the clients have already mounted their nfs filesystem, the
kernel
  exports
table as can be seen from /proc/fs/nfs/exports gets completely
   flushed
out. We (me and one another colleague) have done some digging
  (mostly
looking into nfsutils codebase) and it looks like a kernel side
   issue.
We had also asked folks in the linux-mips mailing list, but
   apparently
no one has any clue. I am just hoping that those who are more
   familiar
with the user level and kernel side of nfs might me something
 more
  to
chew on. If you can give any suggestions that will be really
  useful.
   If
you think the information I provided is not enough, I can give
 you
   any
other information you need in this regard.
  
   Does the MIPS box have the /proc/fs/nfsd/ filesystem mounted?
 
  Ahh, I see what you mean. Yes, it is mounted, both /proc/fs/nfsd and
  /proc/fs/nfs. However, I can see from the code that
check_new_cache()
  checks for a file filehandle which does not exist in that
location.
  To be dead sure, I instrumented the code to insert a perror and it
  returns no such file or directory. The new_cache flag remains 0.
Is
  this some sort of kernel bug?
 
 
   Perhaps you could try
  
   1) running exportfs under strace.  I suggest
  strace -o /tmp/s.log -s 1024 exportfs ...
 
  Strace does not work in our environment as it has not been 

Re: [NFS] [PATCH] Make UDF exportable

2008-02-06 Thread Neil Brown
On Wednesday February 6, [EMAIL PROTECTED] wrote:
   + dotdot.d_name.name = ..;
   + dotdot.d_name.len = 2;
   +
   + lock_kernel();
   + if (!udf_find_entry(child-d_inode, dotdot, fibh, cfi))
   + goto out_unlock;
Have you ever tried this? I think this could never work. UDF doesn't have
  entry named .. in a directory. You have to search for an entry that has
  in fileCharacteristics set bit FID_FILE_CHAR_PARENT. Maybe you could
  hack-around udf_find_entry() to recognize .. dentry and do the search
  accordingly.
 Probably not. I just tested that I could read files and navigate the
 directory structure. However looking into UDF I think you are right - it
 will fail.
 I have extended udf_find_entry() to do an explicit check based on
 fileCharacteristics as you propose.
 How do I actually test this case?

 - Mount the filesystem from the server.
 - 'cd' a few directories down into the filesystem.
 - reboot the server(1)
 - on the client 'ls -l'.

(1) A full reboot isn't needed.  Just unexport, unmount, remount,
re-export on the server.

alternately, use a non-linux client and cd down into the filesystem
and
ls -l ..

NeilBrown


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
NFS maillist  -  [EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nfs
___
Please note that [EMAIL PROTECTED] is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
http://vger.kernel.org/vger-lists.html#linux-nfs

-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: kernel exports table flushes out on running exportfs -a over mips

2008-02-06 Thread Anirban Sinha
Sorry, I does look like it indeed solved the problem. Clearly, I have
missed something in my analysis of the codebase. In any case, thanks a
lot. 

Good night,

Ani


 -Original Message-
 From: Neil Brown [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, February 06, 2008 9:22 PM
 To: Anirban Sinha
 Cc: Greg Banks; linux-nfs@vger.kernel.org
 Subject: RE: kernel exports table flushes out on running exportfs -a
 over mips
 
 On Thursday January 31, [EMAIL PROTECTED] wrote:
  
   Does the MIPS box have the /proc/fs/nfsd/ filesystem mounted?
 
  Ahh, I see what you mean. Yes, it is mounted, both /proc/fs/nfsd and
  /proc/fs/nfs. However, I can see from the code that
check_new_cache()
  checks for a file filehandle which does not exist in that
location.
 To
  be dead sure, I instrumented the code to insert a perror and it
 returns
  no such file or directory. The new_cache flag remains 0. Is this
 some
  sort of kernel bug?
 
 OK, that means that /proc/fs/nfs is *not* mounted.
 
 /proc is mounted, and it contains several directories including
 /proc/fs/nfs and proc/fs/nfsd.
 
 To get modern NFS service, you need to
mount -t nfsd nfsd /proc/fs/nfsd
 
 before running any nfsd related programs (e.g. mountd, nfsd).
 Most distro do that in their startup scripts.  It seems you are
 missing this.
 
 However it should still work.  It seems that it doesn't.
 I tried without /proc/fs/nfsd mounted and got the same result as you.
 It seems that we broke things when /var/lib/nfs/rmtab was changed to
 store IP addresses rather than host names.
 
 The following patch to nfs-utils will fix it.   Or you can just mount
 the 'nfsd' filesystem as above.
 
 NeilBrown
 
 
 
 diff --git a/support/export/client.c b/support/export/client.c
 index 1cb242f..e96f5e0 100644
 --- a/support/export/client.c
 +++ b/support/export/client.c
 @@ -462,5 +462,5 @@ client_gettype(char *ident)
   sp++; if(!isdigit(*sp) || strtoul(sp, sp, 10)  255 || *sp !=
 '.') return MCL_FQDN;
   sp++; if(!isdigit(*sp) || strtoul(sp, sp, 10)  255 || *sp !=
 '\0') return MCL_FQDN;
   /* we lie here a bit. but technically N.N.N.N == N.N.N.N/32 :)
*/
 - return MCL_SUBNETWORK;
 + return MCL_FQDN;
  }
-
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html