NFS performance (Currently 2.6.20)
Hi. I'm currently trying to optimize our NFS server. We're running in a cluster setup with a single NFS server and some compute nodes pulling data from it. Currently the dataset is less than 10GB so it fits in memory of the NFS-server. (confirmed via vmstat 1). Currently I'm getting around 500mbit (700 peak) of the server on a gigabit link and the server is CPU-bottlenecked when this happens. Clients having iowait around 30-50%. Is it reasonable to expect to be able to fill a gigabit link in this scenario? (I'd like to put in a 10Gbit interface, but when I have a cpu-bottleneck) Should I go for NFSv2 (default if I dont change mount options) NFSv3 ? or NFSv4 NFSv3 default mount options is around 1MB for rsize and wsize, but reading the nfs-man page, they suggest setting them up to around 32K. I probably only need some pointers to the documentation. Thanks. -- Jesper Krogh - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] NLM: Convert lockd to use kthreads
On Tue, 5 Feb 2008 23:35:48 -0500 Christoph Hellwig [EMAIL PROTECTED] wrote: On Tue, Feb 05, 2008 at 02:37:57PM -0500, Jeff Layton wrote: Because kthread_stop blocks until the kthread actually goes down, we have to send the signal before calling it. This means that there is a very small race window like this where lockd_down could block for a long time: lockd_down signals lockd lockd invalidates locks lockd flushes signals lockd checks kthread_should_stop lockd_down calls kthread_stop lockd calls svc_recv ...and lockd blocks until recvmsg returns. I think this is a pretty unlikely scenario though. We could probably ensure it doesn't happen with some locking but I'm not sure that it would be worth the trouble. This is not avoidable unless we take sending the signal into the kthread machinery. Yes. Perhaps we should consider a kthread_stop_with_signal() function that does a kthread_stop and sends a signal before waiting for completion? Most users of kthread_stop won't need it, but it would be nice here. CIFS could also probably use something like that. You should probably add a comment similar to your patch description above the place where the signal is sent. I'll do that and respin... In the interest of full disclosure, we have some other options besides sending a signal here: 1) we could call svc_recv with a shorter timeout. This means that lockd will wake up more frequently, even when it has nothing to do. 2) we could try to ensure that when lockd_down is called that a msg (maybe a NULL procedure) is sent to lockd's socket to wake it up after kthread_stop is called. This probably would mean queuing up a task to a workqueue to do this. ...neither of these seem more palatable than sending a signal. -- Jeff Layton [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS performance (Currently 2.6.20)
Hi, On 02/06/2008 11:04:34 AM +0100, Jesper Krogh [EMAIL PROTECTED] wrote: Hi. I'm currently trying to optimize our NFS server. We're running in a cluster setup with a single NFS server and some compute nodes pulling data from it. Currently the dataset is less than 10GB so it fits in memory of the NFS-server. (confirmed via vmstat 1). Currently I'm getting around 500mbit (700 peak) of the server on a gigabit link and the server is CPU-bottlenecked when this happens. Clients having iowait around 30-50%. I have a similar setup, and I'm very curious on how you can read an iowait value from the clients: On my nodes (server 2.6.21.5/clients 2.6.23.14), the iowait counter is only incremented when dealing with block devices, and since my nodes are diskless my iowait is near 0%. Maybe I'm wrong, but when the NFS servers lags, this is my system counter which is increased (having peaks at 30% system instead of 5-10%) Is it reasonable to expect to be able to fill a gigabit link in this scenario? (I'd like to put in a 10Gbit interface, but when I have a cpu-bottleneck) I'm sure this is possible, but it is very dependant on which kind of traffic you have. If you have only data to pull (which theoretically never invalidate the page cache on the server), and you have options like 'noatime,nodiratime' to avoid nfs updating the access times, it seems possible to me. But maybe your CPU is busy doing something else than only computing NFS traffic. Maybe you should change your network controller ? I use the Intel Gigabit ones (integrated ESB2 with e1000 driver) with rx-polling and Intel I/OAT enabled (DMA engine), and this really helps by reducing interrupts when dealing with a lot of traffic. You will have to check your kernel if you have IOAT enabled in the DMA engines section. Should I go for NFSv2 (default if I dont change mount options) NFSv3 ? or NFSv4 NFSv2/3 have nearly the same performance, and NFSv4 has a slight negative hit probably because of its earlyness: it's too early to work on the performances when features are not completely stable. NFSv3 default mount options is around 1MB for rsize and wsize, but reading the nfs-man page, they suggest setting them up to around 32K. the values for rsize and wsize mount options depends on the amount of memory you have (on the server AFAIK), and when you have 4GB the values are not very realistic anymore. On my systems I have the defaults rsize/wsize set to 512KB and all is running fine, but I sure there is some work to be done to adjust more precisely the buffers size when dealing with large memory amounts (e.g. a 1MB buffer is a non-sense). The 32k value in a very old one and the man page doesn't even explain the memory-related rsize/wsize values. I probably only need some pointers to the documentation. And the documentation probably needs some refresh, but things are changing nearly every week here... Gabriel - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: (fwd) nfs hang on 2.6.24
On Wed, 2008-02-06 at 19:24 +1300, Andrew Dixie wrote: The fact that the delegreturn call appears to have hit xprt_timer is interesting. Under normal circumstances, timeouts should never occur under NFSv4. Could you tell us what mount options you're using here? Also please could you confirm for us that the server is still up and responding to requests from other clients. The mount options were defaults: i.e. mount -t nfs4 server:/mnt /mnt sshd has died. I will confrim exactly what is in /proc/mounts when I get physical access. The server is still up serving active nfsv3 clients. I mounted nfsv4 on another client and that worked too. Thanks. My other questions are: What is rpciod doing while the machine hangs? Does 'netstat -t' show an active tcp connection to the server? Does tcpdump show any traffic going on the wire? What server are you running against? From the error messages below, I see it is a Linux machine, but which kernel is it running? The following appears in the server logs: Feb 4 08:28:01 devfile kernel: NFSD: setclientid: string in use by client(clientid 47945499/1c88) Feb 4 08:34:18 devfile kernel: NFSD: setclientid: string in use by client(clientid 47945499/1c8d) Feb 4 08:38:02 devfile kernel: NFSD: setclientid: string in use by client(clientid 47945499/1c8f) Feb 4 10:01:02 devfile kernel: NFSD: setclientid: string in use by client(clientid 47a627bd/0002) Feb 4 10:07:37 devfile kernel: NFSD: setclientid: string in use by client(clientid 47a627bd/0005) Feb 4 10:17:02 devfile kernel: NFSD: setclientid: string in use by client(clientid 47a627bd/019e) Feb 5 07:59:58 devfile kernel: NFSD: setclientid: string in use by client(clientid 47a627bd/03f2) Feb 5 08:01:02 devfile kernel: NFSD: setclientid: string in use by client(clientid 47a627bd/03f3) These are not close to the times that it hung. Yep. The above is entirely expected, and is not actually a bug. I keep asking Bruce to remove that warning... Prior to Feb 4 it occurs 10 to 50 times a day (from when the client was running 2.6.18 kernel) There is only one nfsv4 client. OK... Thanks Trond - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: (fwd) nfs hang on 2.6.24
On Wed, 2008-02-06 at 10:07 -0500, J. Bruce Fields wrote: That went into 2.6.22: 21315edd4877b593d5bf.. [PATCH] knfsd: nfsd4: demote clientid in use printk to a dprintk It may suggest a problem if this is happening a lot, though, right? The client should always be able to generate a new unique clientid if this happens. Trond - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS performance (Currently 2.6.20)
On Wed, 2008-02-06 at 15:37 +0100, Gabriel Barazer wrote: Should I go for NFSv2 (default if I dont change mount options) NFSv3 ? or NFSv4 NFSv2/3 have nearly the same performance Only if you shoot yourself in the foot by setting the 'async' flag in /etc/exports. Don't do that... Most people will want to use NFSv3 for performance reasons. Unlike NFSv2 with 'async', NFSv3 with the 'sync' export flag set actually does _safe_ server-side caching of writes. Trond - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS performance (Currently 2.6.20)
Hi, I'm currently trying to optimize our NFS server. We're running in a cluster setup with a single NFS server and some compute nodes pulling data from it. Currently the dataset is less than 10GB so it fits in memory of the NFS-server. (confirmed via vmstat 1). Currently I'm getting around 500mbit (700 peak) of the server on a gigabit link and the server is CPU-bottlenecked when this happens. Clients having iowait around 30-50%. I have a similar setup, and I'm very curious on how you can read an iowait value from the clients: On my nodes (server 2.6.21.5/clients 2.6.23.14), the iowait counter is only incremented when dealing with block devices, and since my nodes are diskless my iowait is near 0%. Output in top is like this: top - 16:51:01 up 119 days, 6:10, 1 user, load average: 2.09, 2.00, 1.41 Tasks: 74 total, 2 running, 72 sleeping, 0 stopped, 0 zombie Cpu(s): 0.2%us, 0.0%sy, 0.0%ni, 50.0%id, 49.8%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 2060188k total, 2047488k used,12700k free, 2988k buffers Swap: 4200988k total,42776k used, 4158212k free, 1985500k cached Is it reasonable to expect to be able to fill a gigabit link in this scenario? (I'd like to put in a 10Gbit interface, but when I have a cpu-bottleneck) I'm sure this is possible, but it is very dependant on which kind of traffic you have. If you have only data to pull (which theoretically never invalidate the page cache on the server), and you have options like 'noatime,nodiratime' to avoid nfs updating the access times, it seems possible to me. But maybe your CPU is busy doing something else than only computing NFS traffic. Maybe you should change your network controller ? I use the Intel Gigabit ones (integrated ESB2 with e1000 driver) with rx-polling and Intel I/OAT enabled (DMA engine), and this really helps by reducing interrupts when dealing with a lot of traffic. It is a Sun V20Z (dual Opteron) NIC is: 02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03) Jesper -- Jesper Krogh - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] NLM: set RPC_CLNT_CREATE_NOPING for NLM RPC clients
It's currently possible for an unresponsive NLM client to completely lock up a server's lockd. The scenario is something like this: 1) client1 (or a process on the server) takes a lock on a file 2) client2 tries to take a blocking lock on the same file and awaits the callback 3) client2 goes unresponsive (plug pulled, network partition, etc) 4) client1 releases the lock ...at that point the server's lockd will try to queue up a GRANT_MSG callback for client2, but first it requeues the block with a timeout of 30s. nlm_async_call will attempt to bind the RPC client to client2 and will call rpc_ping. rpc_ping entails a sync RPC call and if client2 is unresponsive it will take around 60s for that to time out. Once it times out, it's already time to retry the block and the whole process repeats. Once in this situation, nlmsvc_retry_blocked will never return until the host starts responding again. lockd won't service new calls. Fix this by skipping the RPC ping on NLM RPC clients. This makes nlm_async_call return quickly when called. Signed-off-by: Jeff Layton [EMAIL PROTECTED] --- fs/lockd/host.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/fs/lockd/host.c b/fs/lockd/host.c index ca6b16f..00063ee 100644 --- a/fs/lockd/host.c +++ b/fs/lockd/host.c @@ -244,6 +244,7 @@ nlm_bind_host(struct nlm_host *host) .version= host-h_version, .authflavor = RPC_AUTH_UNIX, .flags = (RPC_CLNT_CREATE_HARDRTRY | + RPC_CLNT_CREATE_NOPING | RPC_CLNT_CREATE_AUTOBIND), }; -- 1.5.3.8 - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] NLM: fix lockd hang when client blocking on released lock isn't responding
This patchset fixes the problem that Bruce pointed out last week when we were discussing the lockd-kthread patches. The main problem is described in patch #1 and that patch also fixes the DoS. The remaining patches clean up how GRANT_MSG callbacks handle an unresponsive client. The goal in those is to make sure that we don't end up with a ton of duplicate RPC's in queue and that we try to handle an invalidated block correctly. Bruce, I'd like to see this fixed in 2.6.25 if at all possible. Comments and suggestions are appreciated. Signed-off-by: Jeff Layton [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] NLM: don't reattempt GRANT_MSG when there is already an RPC in flight
With the current scheme in nlmsvc_grant_blocked, we can end up with more than one GRANT_MSG callback for a block in flight. Right now, we requeue the block unconditionally so that a GRANT_MSG callback is done again in 30s. If the client is unresponsive, it can take more than 30s for the call already in flight to time out. There's no benefit to having more than one GRANT_MSG RPC queued up at a time, so put it on the list with a timeout of NLM_NEVER before doing the RPC call. If the RPC call submission fails, we requeue it with a short timeout. If it works, then nlmsvc_grant_callback will end up requeueing it with a shorter timeout after it completes. Signed-off-by: Jeff Layton [EMAIL PROTECTED] --- fs/lockd/svclock.c | 17 + 1 files changed, 13 insertions(+), 4 deletions(-) diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c index 2f4d8fa..82db7b3 100644 --- a/fs/lockd/svclock.c +++ b/fs/lockd/svclock.c @@ -763,11 +763,20 @@ callback: dprintk(lockd: GRANTing blocked lock.\n); block-b_granted = 1; - /* Schedule next grant callback in 30 seconds */ - nlmsvc_insert_block(block, 30 * HZ); + /* keep block on the list, but don't reattempt until the RPC +* completes or the submission fails +*/ + nlmsvc_insert_block(block, NLM_NEVER); - /* Call the client */ - nlm_async_call(block-b_call, NLMPROC_GRANTED_MSG, nlmsvc_grant_ops); + /* Call the client -- use a soft RPC task since nlmsvc_retry_blocked +* will queue up a new one if this one times out +*/ + error = nlm_async_call(block-b_call, NLMPROC_GRANTED_MSG, + nlmsvc_grant_ops); + + /* RPC submission failed, wait a bit and retry */ + if (error 0) + nlmsvc_insert_block(block, 10 * HZ); } /* -- 1.5.3.8 - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] NLM: don't requeue block if it was invalidated while GRANT_MSG was in flight
It's possible for lockd to catch a SIGKILL while a GRANT_MSG callback is in flight. If this happens we don't want lockd to insert the block back into the nlm_blocked list. This helps that situation, but there's still a possible race. Fixing that will mean adding real locking for nlm_blocked. Signed-off-by: Jeff Layton [EMAIL PROTECTED] --- fs/lockd/svclock.c | 11 +++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c index 82db7b3..fe9bdb4 100644 --- a/fs/lockd/svclock.c +++ b/fs/lockd/svclock.c @@ -795,6 +795,17 @@ static void nlmsvc_grant_callback(struct rpc_task *task, void *data) dprintk(lockd: GRANT_MSG RPC callback\n); + /* if the block is not on a list at this point then it has +* been invalidated. Don't try to requeue it. +* +* FIXME: it's possible that the block is removed from the list +* after this check but before the nlmsvc_insert_block. In that +* case it will be added back. Perhaps we need better locking +* for nlm_blocked? +*/ + if (list_empty(block-b_list)) + return; + /* Technically, we should down the file semaphore here. Since we * move the block towards the head of the queue only, no harm * can be done, though. */ -- 1.5.3.8 - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: (fwd) nfs hang on 2.6.24
On Wed, Feb 06, 2008 at 10:15:23AM -0500, Trond Myklebust wrote: On Wed, 2008-02-06 at 10:07 -0500, J. Bruce Fields wrote: That went into 2.6.22: 21315edd4877b593d5bf.. [PATCH] knfsd: nfsd4: demote clientid in use printk to a dprintk It may suggest a problem if this is happening a lot, though, right? The client should always be able to generate a new unique clientid if this happens. And then the client may fail to reclaim its state on the next server reboot, or mistakenly prevent some other client from reclaiming state, since it's not recording the new clientid in stable storage. So if it's happening a lot then we I suppose we should figure out better ways to generate client id's. --b. - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: (fwd) nfs hang on 2.6.24
On Wed, 2008-02-06 at 12:23 -0500, J. Bruce Fields wrote: On Wed, Feb 06, 2008 at 10:15:23AM -0500, Trond Myklebust wrote: On Wed, 2008-02-06 at 10:07 -0500, J. Bruce Fields wrote: That went into 2.6.22: 21315edd4877b593d5bf.. [PATCH] knfsd: nfsd4: demote clientid in use printk to a dprintk It may suggest a problem if this is happening a lot, though, right? The client should always be able to generate a new unique clientid if this happens. And then the client may fail to reclaim its state on the next server reboot, or mistakenly prevent some other client from reclaiming state, since it's not recording the new clientid in stable storage. So if it's happening a lot then we I suppose we should figure out better ways to generate client id's. Huh? If the server reboots, the client will try to reclaim state using the _same_ client identifier string. Two clients should _not_ be able to generate the same clientid unless they're also sharing the same IP address and a number of other properties that we include in the client identifier. - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] convert lockd to kthread API (try #10)
This is the tenth iteration of the patchset to convert lockd to use the kthread API. This patchset is smaller than the earlier ones since some of the patches in those sets have already been taken into Bruce's tree. This set only changes lockd to use the kthread API. The only real difference between this patchset and the one posted yesterday is some added comments to clarify the possible race involved when signaling and calling kthread_stop. Bruce, would you be willing to take this into your git tree once 2.6.25 development settles down? I'd like to have this considered for 2.6.26. Thanks, Signed-off-by: Jeff Layton [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] SUNRPC: export svc_sock_update_bufs
Needed since the plan is to not have a svc_create_thread helper and to have current users of that function just call kthread_run directly. Signed-off-by: Jeff Layton [EMAIL PROTECTED] Reviewed-by: NeilBrown [EMAIL PROTECTED] Signed-off-by: J. Bruce Fields [EMAIL PROTECTED] --- net/sunrpc/svcsock.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 1d3e5fc..b73a92a 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -1101,6 +1101,7 @@ void svc_sock_update_bufs(struct svc_serv *serv) } spin_unlock_bh(serv-sv_lock); } +EXPORT_SYMBOL(svc_sock_update_bufs); /* * Initialize socket for RPC use and create svc_sock struct -- 1.5.3.8 - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] NLM: Convert lockd to use kthreads
Have lockd_up start lockd using kthread_run. With this change, lockd_down now blocks until lockd actually exits, so there's no longer need for the waitqueue code at the end of lockd_down. This also means that only one lockd can be running at a time which simplifies the code within lockd's main loop. This also adds a check for kthread_should_stop in the main loop of nlmsvc_retry_blocked and after that function returns. There's no sense continuing to retry blocks if lockd is coming down anyway. The main difference between this patch and earlier ones is that it changes lockd_down to again send SIGKILL to lockd when it's coming down. svc_recv() uses schedule_timeout, so we can end up blocking there for a long time if we end up calling into it after kthread_stop wakes up lockd. Sending a SIGKILL should help ensure that svc_recv returns quickly if this occurs. Because kthread_stop blocks until the kthread actually goes down, we have to send the signal before calling it. This means that there is a very small race window like this where lockd_down could block for a long time: lockd_down signals lockd lockd invalidates locks lockd flushes signals lockd checks kthread_should_stop lockd_down calls kthread_stop lockd calls svc_recv ...and lockd blocks until svc_recv returns. I think this is a pretty unlikely scenario though. This doesn't appear to be fixable without changing the kthread_stop machinery to send a signal. Signed-off-by: Jeff Layton [EMAIL PROTECTED] --- fs/lockd/svc.c | 144 +--- fs/lockd/svclock.c |3 +- 2 files changed, 72 insertions(+), 75 deletions(-) diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c index 0822646..35e5ae2 100644 --- a/fs/lockd/svc.c +++ b/fs/lockd/svc.c @@ -25,6 +25,7 @@ #include linux/smp.h #include linux/smp_lock.h #include linux/mutex.h +#include linux/kthread.h #include linux/freezer.h #include linux/sunrpc/types.h @@ -48,14 +49,11 @@ EXPORT_SYMBOL(nlmsvc_ops); static DEFINE_MUTEX(nlmsvc_mutex); static unsigned intnlmsvc_users; -static pid_t nlmsvc_pid; +static struct task_struct *nlmsvc_task; static struct svc_serv *nlmsvc_serv; intnlmsvc_grace_period; unsigned long nlmsvc_timeout; -static DECLARE_COMPLETION(lockd_start_done); -static DECLARE_WAIT_QUEUE_HEAD(lockd_exit); - /* * These can be set at insmod time (useful for NFS as root filesystem), * and also changed through the sysctl interface. -- Jamie Lokier, Aug 2003 @@ -111,35 +109,30 @@ static inline void clear_grace_period(void) /* * This is the lockd kernel thread */ -static void -lockd(struct svc_rqst *rqstp) +static int +lockd(void *vrqstp) { int err = 0; + struct svc_rqst *rqstp = vrqstp; unsigned long grace_period_expire; - /* Lock module and set up kernel thread */ - /* lockd_up is waiting for us to startup, so will -* be holding a reference to this module, so it -* is safe to just claim another reference -*/ - __module_get(THIS_MODULE); - lock_kernel(); - - /* -* Let our maker know we're running. -*/ - nlmsvc_pid = current-pid; - nlmsvc_serv = rqstp-rq_server; - complete(lockd_start_done); - - daemonize(lockd); + /* try_to_freeze() is called from svc_recv() */ set_freezable(); - /* Process request with signals blocked, but allow SIGKILL. */ + /* Allow SIGKILL to tell lockd to drop all of its locks */ allow_signal(SIGKILL); dprintk(NFS locking service started (ver LOCKD_VERSION ).\n); + /* +* FIXME: it would be nice if lockd didn't spend its entire life +* running under the BKL. At the very least, it would be good to +* have someone clarify what it's intended to protect here. I've +* seen some handwavy posts about posix locking needing to be +* done under the BKL, but it's far from clear. +*/ + lock_kernel(); + if (!nlm_timeout) nlm_timeout = LOCKD_DFLT_TIMEO; nlmsvc_timeout = nlm_timeout * HZ; @@ -148,10 +141,9 @@ lockd(struct svc_rqst *rqstp) /* * The main request loop. We don't terminate until the last -* NFS mount or NFS daemon has gone away, and we've been sent a -* signal, or else another process has taken over our job. +* NFS mount or NFS daemon has gone away. */ - while ((nlmsvc_users || !signalled()) nlmsvc_pid == current-pid) { + while (!kthread_should_stop()) { long timeout = MAX_SCHEDULE_TIMEOUT; char buf[RPC_MAX_ADDRBUFLEN]; @@ -161,6 +153,7 @@ lockd(struct svc_rqst *rqstp) nlmsvc_invalidate_all();
Re: (fwd) nfs hang on 2.6.24
On Wed, Feb 06, 2008 at 12:52:17PM -0500, Trond Myklebust wrote: On Wed, 2008-02-06 at 12:23 -0500, J. Bruce Fields wrote: On Wed, Feb 06, 2008 at 10:15:23AM -0500, Trond Myklebust wrote: On Wed, 2008-02-06 at 10:07 -0500, J. Bruce Fields wrote: That went into 2.6.22: 21315edd4877b593d5bf.. [PATCH] knfsd: nfsd4: demote clientid in use printk to a dprintk It may suggest a problem if this is happening a lot, though, right? The client should always be able to generate a new unique clientid if this happens. And then the client may fail to reclaim its state on the next server reboot, or mistakenly prevent some other client from reclaiming state, since it's not recording the new clientid in stable storage. So if it's happening a lot then we I suppose we should figure out better ways to generate client id's. Huh? If the server reboots, the client will try to reclaim state using the _same_ client identifier string. Oh, right, I was confusing client and server reboot and assuming the client would forget the uniquifier on server reboot. That's obviously wrong! The client will forget its own uniquifier on client reboot, but that's alright since it's happy enough just to let that old state time out at that point. So the only possible problem is suboptimal behavior when the client reboot time is less than the lease time. Two clients should _not_ be able to generate the same clientid unless they're also sharing the same IP address and a number of other properties that we include in the client identifier. Or unless two client implementations just happen to have clashing clientid generation algorithms, but we hope that's unlikely. (Except that older Linux clients were prone to produce the same clientid, if I remember right. But the more likely explanation may be that these are the result of a single client destroying and then creating state on the server within a lease period, and the server being stubborn and refusing to let go of the old state (even though no opens are associated with it any more) until the end of a lease period. I think that's a server bug.) --b. - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] NLM: Convert lockd to use kthreads
On Wed, 2008-02-06 at 13:21 -0500, Jeff Layton wrote: Have lockd_up start lockd using kthread_run. With this change, lockd_down now blocks until lockd actually exits, so there's no longer need for the waitqueue code at the end of lockd_down. This also means that only one lockd can be running at a time which simplifies the code within lockd's main loop. This also adds a check for kthread_should_stop in the main loop of nlmsvc_retry_blocked and after that function returns. There's no sense continuing to retry blocks if lockd is coming down anyway. The main difference between this patch and earlier ones is that it changes lockd_down to again send SIGKILL to lockd when it's coming down. svc_recv() uses schedule_timeout, so we can end up blocking there for a long time if we end up calling into it after kthread_stop wakes up lockd. Sending a SIGKILL should help ensure that svc_recv returns quickly if this occurs. Because kthread_stop blocks until the kthread actually goes down, we have to send the signal before calling it. This means that there is a very small race window like this where lockd_down could block for a long time: Having looked again at the code, could you please remind me _why_ we need to signal the process? AFAICS, kthread_stop() should normally wake the process up if it is in the schedule_timeout() state in svc_recv() since it uses wake_up_process(). Shouldn't the only difference be that svc_recv() will return -EAGAIN instead of -EINTR? If so, why can't we just forgo the signal? Trond - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS performance (Currently 2.6.20)
On Wed, 2008-02-06 at 19:24 +0100, Gabriel Barazer wrote: Oops (tm)! Fortunately I do mostly reads, but maybe the exports(5) man page should be updated. According to the man page, I thought that although writes aren't commited to the block devices, the server-side cache is correctly synchronized (but lost if you pull the plug). ...or if the server crashes for some reason. Thanks for the explanation. Having a battery backed large write cache on the server, is there a performance hit when switching from async to sync in NFSv3 ? The main performance hits occur on operations like create(), mkdir(), rename and unlink() since they are required to be immediately synced to disk. IOW: there will be a noticeable overhead when writing lots of small files. For large files, the overhead should be minimal, since all writes can be cached by the server until the close() operation. Trond - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] NLM: Convert lockd to use kthreads
On Wed, 06 Feb 2008 13:36:31 -0500 Trond Myklebust [EMAIL PROTECTED] wrote: On Wed, 2008-02-06 at 13:21 -0500, Jeff Layton wrote: Have lockd_up start lockd using kthread_run. With this change, lockd_down now blocks until lockd actually exits, so there's no longer need for the waitqueue code at the end of lockd_down. This also means that only one lockd can be running at a time which simplifies the code within lockd's main loop. This also adds a check for kthread_should_stop in the main loop of nlmsvc_retry_blocked and after that function returns. There's no sense continuing to retry blocks if lockd is coming down anyway. The main difference between this patch and earlier ones is that it changes lockd_down to again send SIGKILL to lockd when it's coming down. svc_recv() uses schedule_timeout, so we can end up blocking there for a long time if we end up calling into it after kthread_stop wakes up lockd. Sending a SIGKILL should help ensure that svc_recv returns quickly if this occurs. Because kthread_stop blocks until the kthread actually goes down, we have to send the signal before calling it. This means that there is a very small race window like this where lockd_down could block for a long time: Having looked again at the code, could you please remind me _why_ we need to signal the process? AFAICS, kthread_stop() should normally wake the process up if it is in the schedule_timeout() state in svc_recv() since it uses wake_up_process(). Shouldn't the only difference be that svc_recv() will return -EAGAIN instead of -EINTR? If so, why can't we just forgo the signal? There's no guarantee that kthread_stop() won't wake up lockd before schedule_timeout() gets called, but after the last check for kthread_should_stop(). -- Jeff Layton [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] NLM: Convert lockd to use kthreads
On Wed, 2008-02-06 at 13:47 -0500, Jeff Layton wrote: There's no guarantee that kthread_stop() won't wake up lockd before schedule_timeout() gets called, but after the last check for kthread_should_stop(). Doesn't the BKL pretty much eliminate this race? (assuming you transform that call to 'if (!kthread_should_stop()) schedule_timeout();') Trond - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] NLM: Convert lockd to use kthreads
On Wed, 06 Feb 2008 13:52:34 -0500 Trond Myklebust [EMAIL PROTECTED] wrote: On Wed, 2008-02-06 at 13:47 -0500, Jeff Layton wrote: There's no guarantee that kthread_stop() won't wake up lockd before schedule_timeout() gets called, but after the last check for kthread_should_stop(). Doesn't the BKL pretty much eliminate this race? (assuming you transform that call to 'if (!kthread_should_stop()) schedule_timeout();') Trond I don't think so. That would require that lockd_down is always called with the BKL held, and I don't think it is, is it? -- Jeff Layton [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS EINVAL on open(... | O_TRUNC) on 2.6.23.9
Hi Gianluca- On Feb 6, 2008, at 1:25 PM, Gianluca Alberici wrote: Hello all, Thanks to Chuck's help i finally decided to proceed to a git bisect and found the bad patch. Is there anybody that has an idea why it breaks userspace nfs servers as we have seen ? Sorry for emailing directly Chuck Lever and Andrew Morton but i really wanted to thank Chuck for his precious help and thought that /akpm/ having signed this commit maybe he's going to figure out whats wrong easily The commit you found is a plausible source of the trouble (based on our current theory about the problem). What isn't quite clear to me is whether this commit causes your user- space server to start failing suddenly, or it causes the client to start sending the special non-standard time stamps in the SETATTR request. My guess is the latter, but I want to confirm this guess against reality :-) Are you running the client and server concurrently on the same system? If so, it would be helpful if you could run this test with a constant kernel version on one side while varying it on the other. If client and server are already on different systems, can you tell us which system and which kernel combinations caused the failure? A matrix of combinations might be: 1. server kernel is before 1c710c89, client kernel is before 1c710c89 2. server kernel is before 1c710c89, client kernel is after 1c710c89 3. server kernel is after 1c710c89, client kernel is before 1c710c89 4. server kernel is after 1c710c89, client kernel is after 1c710c89 Thanks. This is what i finally get from git: 1c710c896eb461895d3c399e15bb5f20b39c9073 is first bad commit commit 1c710c896eb461895d3c399e15bb5f20b39c9073 Author: Ulrich Drepper [EMAIL PROTECTED] Date: Tue May 8 00:33:25 2007 -0700 utimensat implementation Implement utimensat(2) which is an extension to futimesat(2) in that it a) supports nano-second resolution for the timestamps b) allows to selectively ignore the atime/mtime value c) allows to selectively use the current time for either atime or mtime d) supports changing the atime/mtime of a symlink itself along the lines of the BSD lutimes(3) functions [...] [EMAIL PROTECTED]: add missing i386 syscall table entry] Signed-off-by: Ulrich Drepper [EMAIL PROTECTED] Cc: Alexey Dobriyan [EMAIL PROTECTED] Cc: Michael Kerrisk [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] Signed-off-by: Linus Torvalds [EMAIL PROTECTED] :04 04 3bedbc7fd919ba167b8e5f208a630261570853bb 927002a9423dcb51ba4f7bee53e60cdca6c1df43 M arch :04 04 fd688c5b534efd3111cbf1e1095d6ff631738325 3d0fbf20fb3da1cb380c92f5b2b39815897376d3 M fs :04 04 bfb1a907a9a842db4fa3543e12a8381d4e11b1eb 9c1d99324db12e066c0d17870fe48457809ad43b M include Thanks in advance, regards, Gianluca Hi Gianluca- On Jan 30, 2008, at 7:40 AM, Gianluca Alberici wrote: Hello again everybody Here follows the testbench: - I got two mirrors, same machine, same disk etc...chaged hostname, IP, and on the second i have recompiled kernel. - First: 2.6.21.7 on debian sarge - Second: 2.6.22 same system. - Onto both i got nfs-user-server and cfsd last versions - The export file is the same (localhost /opt/nfs (rw, async), stripping off the async option does not changes anything) - Mount options are exactly the same. The problem arises in the very same manner with both nfs and cfsd: NFS:setattr { ... ... RPC:call_decode { return 22; } ... return 22; } Again, there is nothing wrong with the RPC client or call_decode. The *server* is returning NFSERR_INVAL (22) to a SETATTR request; the RPC client is simply passing that along to the NFS client, as it is designed to do. I have tried these kernels: 2.6.16.11 works 2.6.20 works 2.6.21 works 2.6.21.7 works 2.6.22 doesnt work (contiguous to previous version) 2.6.23 doesnt work (same behavior as previous) 2.6.23.9 doesnt work (as above) 2.6.24rc7 doesnt work (as above) I would really like to do more, client or server side, if you ave any suggestions. Can we find out what is the change (doesnt matter if it is a buf or bug fix) that caused this problem ? The goal here is to identify the kernel change between 2.6.21 and 2.6.22 that makes the client generate SETATTR requests the user- space server chokes on. It may be a change in the NFS client, or it could be somewhere else in the file system stack, like the VFS. The usual procedure is to use git bisect. It does a binary search on the kernel patches between the working kernel version and the kernel version that is known not to work. It works like this: 1. You clone a linux kernel git repository (if you don't have a git repository already) 2. You tell git bisect which kernel version is working, and which isn't. git bisect then selects a commit about half way in between the working and non-working versions, and checks
Re: [PATCH 2/2] NLM: Convert lockd to use kthreads
On Wed, 6 Feb 2008 13:47:02 -0500 Jeff Layton [EMAIL PROTECTED] wrote: On Wed, 06 Feb 2008 13:36:31 -0500 Trond Myklebust [EMAIL PROTECTED] wrote: On Wed, 2008-02-06 at 13:21 -0500, Jeff Layton wrote: Have lockd_up start lockd using kthread_run. With this change, lockd_down now blocks until lockd actually exits, so there's no longer need for the waitqueue code at the end of lockd_down. This also means that only one lockd can be running at a time which simplifies the code within lockd's main loop. This also adds a check for kthread_should_stop in the main loop of nlmsvc_retry_blocked and after that function returns. There's no sense continuing to retry blocks if lockd is coming down anyway. The main difference between this patch and earlier ones is that it changes lockd_down to again send SIGKILL to lockd when it's coming down. svc_recv() uses schedule_timeout, so we can end up blocking there for a long time if we end up calling into it after kthread_stop wakes up lockd. Sending a SIGKILL should help ensure that svc_recv returns quickly if this occurs. Because kthread_stop blocks until the kthread actually goes down, we have to send the signal before calling it. This means that there is a very small race window like this where lockd_down could block for a long time: Having looked again at the code, could you please remind me _why_ we need to signal the process? AFAICS, kthread_stop() should normally wake the process up if it is in the schedule_timeout() state in svc_recv() since it uses wake_up_process(). Shouldn't the only difference be that svc_recv() will return -EAGAIN instead of -EINTR? If so, why can't we just forgo the signal? There's no guarantee that kthread_stop() won't wake up lockd before schedule_timeout() gets called, but after the last check for kthread_should_stop(). Sorry, I hit send too quick... I'm certainly open to alternatives to signaling, but having a pending signal seems to be the best way to ensure that we don't end up blocking in schedule_timeout() here. As a side note, I've rolled up a patch to add a kthread_stop_sig() variant that will use force_sig to wake up a kthread instead of just waking it up. I've not tested it yet, but once I do and if we can get it in then we should be able to close the race I'm talking about in this patch description as well... -- Jeff Layton [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS performance (Currently 2.6.20)
Gabriel Barazer wrote: On 02/06/2008 4:59:39 PM +0100, Jesper Krogh [EMAIL PROTECTED] wrote: I have a similar setup, and I'm very curious on how you can read an iowait value from the clients: On my nodes (server 2.6.21.5/clients 2.6.23.14), the iowait counter is only incremented when dealing with block devices, and since my nodes are diskless my iowait is near 0%. Output in top is like this: top - 16:51:01 up 119 days, 6:10, 1 user, load average: 2.09, 2.00, 1.41 Tasks: 74 total, 2 running, 72 sleeping, 0 stopped, 0 zombie Cpu(s): 0.2%us, 0.0%sy, 0.0%ni, 50.0%id, 49.8%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 2060188k total, 2047488k used,12700k free, 2988k buffers Swap: 4200988k total,42776k used, 4158212k free, 1985500k cached You have obviously a block device on your nodes, so I suspect that something is reading/writing to it. Looking at how much memory is used, your system must be constantly swapping. This could explain why your iowait is so high (if your swap space is a block device or a file on a block device. You don't use swap over NFS do you?) No swap over NFS and no swapping at all. A vmstat 1 output of the above situation looks like: procs ---memory-- ---swap-- -io -system-- cpu 0 2 42768 11580 1368 198733600 0 0 638 366 1 0 50 48 0 2 42768 13088 1368 198592400 0 0 695 367 2 1 50 47 0 2 42768 13028 1368 198611200 0 0 345 129 0 0 50 50 1 1 42768 12720 1364 198632800 0 0 1043 710 6 1 50 42 0 1 42768 12648 1364 198730800 0 0 636 374 2 4 50 44 0 2 42768 11608 1364 198843600 0 0 696 382 1 0 51 49 You can also see that there barely is used any swap in the top report. Jesper -- Jesper - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: (fwd) nfs hang on 2.6.24
On Thu, Feb 07, 2008 at 10:19:06AM +1300, Andrew Dixie wrote: Oh, right, I was confusing client and server reboot and assuming the client would forget the uniquifier on server reboot. That's obviously wrong! The client will forget its own uniquifier on client reboot, but that's alright since it's happy enough just to let that old state time out at that point. So the only possible problem is suboptimal behavior when the client reboot time is less than the lease time. There is one client, a stable connection between client and server, and neither client or server are being rebooted. Are the string in use by client messages still expected? Assuming the client creates and destroys clientid's on demand, as they're needed for opens, and uses whatever user credential it has at hand to do so, then I think a sequence of alternating opens and closes as different users could produce this. But no, it doesn't indicate any real problem on its own. Below is a program that attempts to open a file that is contained in a directory that has been deleted by another client. I'm not sure these are conditions that are normally occuring, it's just something I encountered trying to reproduce the hang. This reliably reproduces: Feb 7 09:55:01 devfile kernel: NFSD: preprocess_seqid_op: bad seqid (expected 20, got 22) That's a bug though, either on the client or server side. --b. And about 1 in 10 times it also reproduces: Feb 7 09:55:01 devfile kernel: NFSD: setclientid: string in use by client(clientid 47a627bd/044b) The server is 2.6.18-5 from debian. --- #include string.h #include stdio.h #include unistd.h #include errno.h #include fcntl.h #include sys/stat.h #include stdlib.h #define ASSERT(x) \ if (!(x)) { fprintf(stderr, %s:%i:assert: #x \n, __FILE__, __LINE__); abort(); } #define testdir /home/andrewd/testdir #define testfile testdir /fred int main(int argc, char *argv[]) { int fd; int rv; rv = mkdir(testdir,0777); ASSERT(rv == 0 || errno == EEXIST); fd = open(testfile, O_CREAT|O_WRONLY); ASSERT(fd != -1); rv = write(fd, stuff\n, 6); ASSERT(rv == 6); close(fd); rv = access(testfile, 0); ASSERT(rv == 0); // Remove directory via another client (nfsv3) system(ssh devlin7 rm -r testdir); // Try to open file fd = open(testfile, O_RDONLY); printf(got fd:%i errno:%i\n, fd, errno); // fd == -1, errno = ENOENT // This is expected, error on nfs server is not. return 0; } - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS EINVAL on open(... | O_TRUNC) on 2.6.23.9
On Wed, 06 Feb 2008 22:55:02 +0100 Gianluca Alberici [EMAIL PROTECTED] wrote: I finally got it. Problem and solution have been found from 6 month but nobody cared...up to now those servers have not been mantained, this problem is not discussed anywhere else than the following link. The bug (userspace server side i would say at this point) is well described from the author of an nfs-user-server patch which has not been managed yet. The magic hint to find it on google was 'nfs server utimensat' :-) http://marc.info/?l=linux-nfsm=118724649406144w=2 This is pretty significant. We have on several occasions in recent years tightened up the argument checking on long-standing system calls and it's always a concern that this will break previously-working applications. And now it has happened. If we put buggy code into the kernel then we're largely stuck with it: we need to be back-compatible with our bugs so we don't break things like this. I have already prepared a working patch for cfsd based upon the one ive listed. The nfs patch is of course waiting for commit since august, 2007. Ill submit it to debian cfsd mantainers, hoping to have more chance than my predecessor. It doesnt seem to me that there was any kernel related issue. Thanks a lot again, sorry for the lots of noise i have done. I will try to be more appropriate next time. That wasn't noise - it was quite valuable. Thanks for all the work you did on this. Given that our broken-by-unbreaking code has been out there in several releases there isn't really any point in rebreaking it to fix this - the offending applications need to be repaired so they'll work on 2.6.22 and 2.6.23 anyway. - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: (fwd) nfs hang on 2.6.24
Oh, right, I was confusing client and server reboot and assuming the client would forget the uniquifier on server reboot. That's obviously wrong! The client will forget its own uniquifier on client reboot, but that's alright since it's happy enough just to let that old state time out at that point. So the only possible problem is suboptimal behavior when the client reboot time is less than the lease time. There is one client, a stable connection between client and server, and neither client or server are being rebooted. Are the string in use by client messages still expected? Below is a program that attempts to open a file that is contained in a directory that has been deleted by another client. I'm not sure these are conditions that are normally occuring, it's just something I encountered trying to reproduce the hang. This reliably reproduces: Feb 7 09:55:01 devfile kernel: NFSD: preprocess_seqid_op: bad seqid (expected 20, got 22) And about 1 in 10 times it also reproduces: Feb 7 09:55:01 devfile kernel: NFSD: setclientid: string in use by client(clientid 47a627bd/044b) The server is 2.6.18-5 from debian. --- #include string.h #include stdio.h #include unistd.h #include errno.h #include fcntl.h #include sys/stat.h #include stdlib.h #define ASSERT(x) \ if (!(x)) { fprintf(stderr, %s:%i:assert: #x \n, __FILE__, __LINE__); abort(); } #define testdir /home/andrewd/testdir #define testfile testdir /fred int main(int argc, char *argv[]) { int fd; int rv; rv = mkdir(testdir,0777); ASSERT(rv == 0 || errno == EEXIST); fd = open(testfile, O_CREAT|O_WRONLY); ASSERT(fd != -1); rv = write(fd, stuff\n, 6); ASSERT(rv == 6); close(fd); rv = access(testfile, 0); ASSERT(rv == 0); // Remove directory via another client (nfsv3) system(ssh devlin7 rm -r testdir); // Try to open file fd = open(testfile, O_RDONLY); printf(got fd:%i errno:%i\n, fd, errno); // fd == -1, errno = ENOENT // This is expected, error on nfs server is not. return 0; } - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: (fwd) nfs hang on 2.6.24
What is rpciod doing while the machine hangs? Does 'netstat -t' show an active tcp connection to the server? Does tcpdump show any traffic going on the wire? What server are you running against? From the error messages below, I see it is a Linux machine, but which kernel is it running? Server is 2.6.18-5 from debian. From /proc/mounts: server1:/files /files nfs rw,vers=3,rsize=8192,wsize=8192,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=10.64.2.90 0 0 devfile:/srv/linshared_srv /srv nfs rw,vers=3,rsize=32768,wsize=32768,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=10.64.2.21 0 0 devfile:/home /home nfs4 rw,vers=4,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=3,sec=sys,addr=10.64.2.21 0 0 The nfs connections went into CLOSE_WAIT: tcp0 0 10.64.2.25:888 10.64.2.21:2049 CLOSE_WAIT tcp0 0 10.64.2.25:974 10.64.2.21:2049 CLOSE_WAIT I can't see any traffic for it attempting to reconnect. Below are the rpciod stacktraces from the previous hang. Also rpc.idmap looks to be in the middle of something. Cheers, Andrew rpciod/0 S f76f9e7c 0 2663 2 f7d7c1f0 0046 0002 f76f9e7c f76f9e74 0286 f669bc00 f7d7c358 c180a940 015b37db f669bc00 dfbc8c80 00ff f76f9ebc f76f9ec4 c180284c f8c62e85 c02bc97f Call Trace: [f8c62e85] rpc_wait_bit_interruptible+0x1a/0x1f [sunrpc] [c02bc97f] __wait_on_bit+0x33/0x58 [f8c62e6b] rpc_wait_bit_interruptible+0x0/0x1f [sunrpc] [f8c62e6b] rpc_wait_bit_interruptible+0x0/0x1f [sunrpc] [c02bca07] out_of_line_wait_on_bit+0x63/0x6b [c013545e] wake_bit_function+0x0/0x3c [f8c62e19] __rpc_wait_for_completion_task+0x32/0x39 [sunrpc] [f8ce1352] nfs4_wait_for_completion_rpc_task+0x1b/0x2f [nfs] [f8ce2336] nfs4_proc_delegreturn+0x116/0x172 [nfs] [f8c63411] rpc_async_schedule+0x0/0xa [sunrpc] [f8ced370] nfs_do_return_delegation+0xf/0x1d [nfs] [f8cd135f] nfs_dentry_iput+0xd/0x49 [nfs] [c01865d2] dentry_iput+0x74/0x93 [c018666d] d_kill+0x2d/0x46 [c0186970] dput+0xd5/0xdc [f8ce4016] nfs4_free_closedata+0x26/0x41 [nfs] [f8c62c8d] rpc_release_calldata+0x16/0x20 [sunrpc] [c013220d] run_workqueue+0x7d/0x109 [c0132a83] worker_thread+0x0/0xc5 [c0132b3d] worker_thread+0xba/0xc5 [c0135429] autoremove_wake_function+0x0/0x35 [c0135362] kthread+0x38/0x5e [c013532a] kthread+0x0/0x5e [c0104b0f] kernel_thread_helper+0x7/0x10 rpciod/1-3 identical: df848710 0046 0002 f76fbfa0 f76fbf98 f8c633fd 0572 df848878 c1812940 0001 015b36d3 df9abc08 f8c63411 00ff f776a840 c0132a83 f76fbfd0 c0132b0b Call Trace: [f8c633fd] __rpc_execute+0x21d/0x231 [sunrpc] [f8c63411] rpc_async_schedule+0x0/0xa [sunrpc] [c0132a83] worker_thread+0x0/0xc5 [c0132b0b] worker_thread+0x88/0xc5 [c0135429] autoremove_wake_function+0x0/0x35 [c0135362] kthread+0x38/0x5e [c013532a] kthread+0x0/0x5e [c0104b0f] kernel_thread_helper+0x7/0x10 === rpc.idmapdS f777ff10 0 2687 1 f7cea610 0086 0002 f777ff10 f777ff08 f7cea778 c1822940 0003 015d5741 00ff 7fff f75e2b00 080536e8 0286 c02bc7f1 Call Trace: [c01355e8] add_wait_queue+0x12/0x32 [c017d287] pipe_poll+0x24/0x7d [c0183476] do_select+0x365/0x3bc [c0183a60] __pollwait+0x0/0xac [c011f44f] default_wake_function+0x0/0x8 message repeated 10 times [c0259bb5] skb_release_all+0xa3/0xfa [c025e590] dev_hard_start_xmit+0x20c/0x277 [c026d227] __qdisc_run+0x9e/0x164 [c02564e7] sk_reset_timer+0xc/0x16 [c0260758] dev_queue_xmit+0x288/0x2b0 [c026b72e] eth_header+0x0/0xb6 [c0264fe5] neigh_resolve_output+0x203/0x235 [c027dd59] ip_finish_output+0x0/0x208 [c027df29] ip_finish_output+0x1d0/0x208 [c027edd1] ip_output+0x7d/0x92 [c01e240c] number+0x147/0x215 [c0183750] core_sys_select+0x283/0x2a0 [c01e2d23] vsnprintf+0x440/0x47c [c0187123] d_lookup+0x1b/0x3b [c01a5fe3] proc_flush_task+0x12b/0x235 [c0135a53] posix_cpu_timers_exit_group+0x4a/0x50 [c0108472] convert_fxsr_from_user+0x15/0xd5 [c0183be2] sys_select+0xd6/0x187 [c018a6ce] mntput_no_expire+0x11/0x66 [c0176b05] filp_close+0x51/0x58 [c012743f] sys_wait4+0x31/0x34 [c0103e5e] sysenter_past_esp+0x6b/0xa1 - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: (fwd) nfs hang on 2.6.24
On Thu, 2008-02-07 at 11:40 +1300, Andrew Dixie wrote: What is rpciod doing while the machine hangs? Does 'netstat -t' show an active tcp connection to the server? Does tcpdump show any traffic going on the wire? What server are you running against? From the error messages below, I see it is a Linux machine, but which kernel is it running? Server is 2.6.18-5 from debian. From /proc/mounts: server1:/files /files nfs rw,vers=3,rsize=8192,wsize=8192,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=10.64.2.90 0 0 devfile:/srv/linshared_srv /srv nfs rw,vers=3,rsize=32768,wsize=32768,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=10.64.2.21 0 0 devfile:/home /home nfs4 rw,vers=4,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=3,sec=sys,addr=10.64.2.21 0 0 The nfs connections went into CLOSE_WAIT: tcp0 0 10.64.2.25:888 10.64.2.21:2049 CLOSE_WAIT tcp0 0 10.64.2.25:974 10.64.2.21:2049 CLOSE_WAIT I can't see any traffic for it attempting to reconnect. Below are the rpciod stacktraces from the previous hang. Also rpc.idmap looks to be in the middle of something. Cheers, Andrew rpciod/0 S f76f9e7c 0 2663 2 f7d7c1f0 0046 0002 f76f9e7c f76f9e74 0286 f669bc00 f7d7c358 c180a940 015b37db f669bc00 dfbc8c80 00ff f76f9ebc f76f9ec4 c180284c f8c62e85 c02bc97f Call Trace: [f8c62e85] rpc_wait_bit_interruptible+0x1a/0x1f [sunrpc] [c02bc97f] __wait_on_bit+0x33/0x58 [f8c62e6b] rpc_wait_bit_interruptible+0x0/0x1f [sunrpc] [f8c62e6b] rpc_wait_bit_interruptible+0x0/0x1f [sunrpc] [c02bca07] out_of_line_wait_on_bit+0x63/0x6b [c013545e] wake_bit_function+0x0/0x3c [f8c62e19] __rpc_wait_for_completion_task+0x32/0x39 [sunrpc] [f8ce1352] nfs4_wait_for_completion_rpc_task+0x1b/0x2f [nfs] [f8ce2336] nfs4_proc_delegreturn+0x116/0x172 [nfs] [f8c63411] rpc_async_schedule+0x0/0xa [sunrpc] [f8ced370] nfs_do_return_delegation+0xf/0x1d [nfs] [f8cd135f] nfs_dentry_iput+0xd/0x49 [nfs] [c01865d2] dentry_iput+0x74/0x93 [c018666d] d_kill+0x2d/0x46 [c0186970] dput+0xd5/0xdc [f8ce4016] nfs4_free_closedata+0x26/0x41 [nfs] [f8c62c8d] rpc_release_calldata+0x16/0x20 [sunrpc] [c013220d] run_workqueue+0x7d/0x109 [c0132a83] worker_thread+0x0/0xc5 [c0132b3d] worker_thread+0xba/0xc5 [c0135429] autoremove_wake_function+0x0/0x35 [c0135362] kthread+0x38/0x5e [c013532a] kthread+0x0/0x5e [c0104b0f] kernel_thread_helper+0x7/0x10 That's the bug right there. rpciod should never be calling a synchrounous RPC call. I've already got a fix for this bug against 2.6.24. Could you see if it applies to your kernel too? Cheers Trond ---BeginMessage--- Otherwise, there is a potential deadlock if the last dput() from an NFSv4 close() or other asynchronous operation leads to nfs_clear_inode calling the synchronous delegreturn. Signed-off-by: Trond Myklebust [EMAIL PROTECTED] --- fs/nfs/delegation.c | 29 + fs/nfs/delegation.h |3 ++- fs/nfs/dir.c|1 - fs/nfs/inode.c |2 +- fs/nfs/nfs4proc.c | 22 +- 5 files changed, 41 insertions(+), 16 deletions(-) diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c index b03dcd8..2dead8d 100644 --- a/fs/nfs/delegation.c +++ b/fs/nfs/delegation.c @@ -174,11 +174,11 @@ int nfs_inode_set_delegation(struct inode *inode, struct rpc_cred *cred, struct return status; } -static int nfs_do_return_delegation(struct inode *inode, struct nfs_delegation *delegation) +static int nfs_do_return_delegation(struct inode *inode, struct nfs_delegation *delegation, int issync) { int res = 0; - res = nfs4_proc_delegreturn(inode, delegation-cred, delegation-stateid); + res = nfs4_proc_delegreturn(inode, delegation-cred, delegation-stateid, issync); nfs_free_delegation(delegation); return res; } @@ -208,7 +208,7 @@ static int __nfs_inode_return_delegation(struct inode *inode, struct nfs_delegat up_read(clp-cl_sem); nfs_msync_inode(inode); - return nfs_do_return_delegation(inode, delegation); + return nfs_do_return_delegation(inode, delegation, 1); } static struct nfs_delegation *nfs_detach_delegation_locked(struct nfs_inode *nfsi, const nfs4_stateid *stateid) @@ -228,6 +228,27 @@ nomatch: return NULL; } +/* + * This function returns the delegation without reclaiming opens + * or protecting against delegation reclaims. + * It is therefore really only safe to be called from + * nfs4_clear_inode() + */ +void nfs_inode_return_delegation_noreclaim(struct inode *inode) +{ + struct nfs_client *clp = NFS_SERVER(inode)-nfs_client; + struct nfs_inode *nfsi = NFS_I(inode); + struct nfs_delegation *delegation; + + if (rcu_dereference(nfsi-delegation) != NULL) { +
Re: [PATCH 2/2] NLM: Convert lockd to use kthreads
On Wed, 2008-02-06 at 14:09 -0500, Jeff Layton wrote: On Wed, 06 Feb 2008 13:52:34 -0500 Trond Myklebust [EMAIL PROTECTED] wrote: On Wed, 2008-02-06 at 13:47 -0500, Jeff Layton wrote: There's no guarantee that kthread_stop() won't wake up lockd before schedule_timeout() gets called, but after the last check for kthread_should_stop(). Doesn't the BKL pretty much eliminate this race? (assuming you transform that call to 'if (!kthread_should_stop()) schedule_timeout();') Trond I don't think so. That would require that lockd_down is always called with the BKL held, and I don't think it is, is it? Nothing stops you from grabbing the BKL inside lockd_down, though :-) - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[NFS] [patch 59/73] knfsd: Allow NFSv2/3 WRITE calls to succeed when krb5i etc is used.
2.6.23-stable review patch. If anyone has any objections, please let us know. -- From: NeilBrown [EMAIL PROTECTED] patch ba67a39efde8312e386c6f603054f8945433d91f in mainline. When RPCSEC/GSS and krb5i is used, requests are padded, typically to a multiple of 8 bytes. This can make the request look slightly longer than it really is. As of f34b95689d2ce001c The NFSv2/NFSv3 server does not handle zero length WRITE request correctly, the xdr decode routines for NFSv2 and NFSv3 reject requests that aren't the right length, so krb5i (for example) WRITE requests can get lost. This patch relaxes the appropriate test and enhances the related comment. Signed-off-by: Neil Brown [EMAIL PROTECTED] Signed-off-by: J. Bruce Fields [EMAIL PROTECTED] Cc: Peter Staubach [EMAIL PROTECTED] Signed-off-by: Linus Torvalds [EMAIL PROTECTED] Signed-off-by: Greg Kroah-Hartman [EMAIL PROTECTED] --- fs/nfsd/nfs3xdr.c |5 - fs/nfsd/nfsxdr.c |5 - 2 files changed, 8 insertions(+), 2 deletions(-) --- a/fs/nfsd/nfs3xdr.c +++ b/fs/nfsd/nfs3xdr.c @@ -396,8 +396,11 @@ nfs3svc_decode_writeargs(struct svc_rqst * Round the length of the data which was specified up to * the next multiple of XDR units and then compare that * against the length which was actually received. +* Note that when RPCSEC/GSS (for example) is used, the +* data buffer can be padded so dlen might be larger +* than required. It must never be smaller. */ - if (dlen != XDR_QUADLEN(len)*4) + if (dlen XDR_QUADLEN(len)*4) return 0; if (args-count max_blocksize) { --- a/fs/nfsd/nfsxdr.c +++ b/fs/nfsd/nfsxdr.c @@ -313,8 +313,11 @@ nfssvc_decode_writeargs(struct svc_rqst * Round the length of the data which was specified up to * the next multiple of XDR units and then compare that * against the length which was actually received. +* Note that when RPCSEC/GSS (for example) is used, the +* data buffer can be padded so dlen might be larger +* than required. It must never be smaller. */ - if (dlen != XDR_QUADLEN(len)*4) + if (dlen XDR_QUADLEN(len)*4) return 0; rqstp-rq_vec[0].iov_base = (void*)p; -- - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ NFS maillist - [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nfs ___ Please note that [EMAIL PROTECTED] is being discontinued. Please subscribe to linux-nfs@vger.kernel.org instead. http://vger.kernel.org/vger-lists.html#linux-nfs - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS+krb5: Failed to create krb5 context for user with uid 0
On Feb 5, 2008, at 9:12 PM, Kevin Coffman wrote: If the Mac server code can support other encryption types like Triple DES and ArcFour, you shouldn't need to limit it to only the des-cbc-crc key. The Linux nfs-utils code on the client should be limiting the negotiated encryption type to des. I would assume if normal users are able to get a context and talk to the server, that root using the keytab should be able to do so as well. I added a principal for root/[EMAIL PROTECTED] and added it to the client's keytab and everything appears to work now. I then put the other keys back on the server's keytab as you suggested. Thanks for the help! Luke Notice of Confidentiality: The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, re-transmission, dissemination or other use of or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error please contact the sender immediately by return electronic transmission and then immediately delete this transmission including all attachments without copying, distributing or disclosing the same. - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: kernel exports table flushes out on running exportfs -a over mips
Hi: I did some extensive digging into the codebase and I believe I have the reason why exportfs -a flushes out the caches after NFS clients have mounted the NFS filesystem. The analysis is complicated, but here's the crux of the matter: There is a difference in the /etc/exports and the kernel maintained cache. The difference is that in /etc/exports, we use anonymous clients (*) whereas kernel maintains a FQDN client names in its exports cache (see attached file). This difference (the parsing code client_gettype() specifically checks for a * or an IP or hostname among other things and based on that creates two different types of caches) is causing the nfs codebase to recreate new in core exports entries (the second time when we issue exportfs -a) after parsing /proc/fs/nfs/export. Immediately later, it then throws these away (for these newly created entries, m_mayexport = 0 and m_exported = 1 in function xtab_read()). For details, see the logic in exports_update_one(): if (exp-m_exported !exp-m_mayexport) { ... unexporting ...} Since both the anonymous and FQDN entries are essentially the same, this results in blowing away the existing kernel exports table. My question is, is there a elegant solution to this problem without simply using FQDN in /etc/exports? I have confirmed that the problem does not occur when both the in kernel and /etc/exports tables have same entries (both * or both FQDN). Cheers, Ani -Original Message- From: [EMAIL PROTECTED] [mailto:linux-nfs- [EMAIL PROTECTED] On Behalf Of Anirban Sinha Sent: Thursday, January 31, 2008 2:09 PM To: Greg Banks Cc: linux-nfs@vger.kernel.org Subject: RE: kernel exports table flushes out on running exportfs -a over mips Hi Greg: Thanks for replying. Here goes my response: -Original Message- From: Greg Banks [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 30, 2008 6:37 PM To: Anirban Sinha Cc: linux-nfs@vger.kernel.org Subject: Re: kernel exports table flushes out on running exportfs -a over mips On Wed, Jan 30, 2008 at 05:34:13PM -0800, Anirban Sinha wrote: Hi: I am seeing an unusual problem on running nfs server on mips. Over Intel this does not happen. When I run exportfs -a on the server when the clients have already mounted their nfs filesystem, the kernel exports table as can be seen from /proc/fs/nfs/exports gets completely flushed out. We (me and one another colleague) have done some digging (mostly looking into nfsutils codebase) and it looks like a kernel side issue. We had also asked folks in the linux-mips mailing list, but apparently no one has any clue. I am just hoping that those who are more familiar with the user level and kernel side of nfs might me something more to chew on. If you can give any suggestions that will be really useful. If you think the information I provided is not enough, I can give you any other information you need in this regard. Does the MIPS box have the /proc/fs/nfsd/ filesystem mounted? Ahh, I see what you mean. Yes, it is mounted, both /proc/fs/nfsd and /proc/fs/nfs. However, I can see from the code that check_new_cache() checks for a file filehandle which does not exist in that location. To be dead sure, I instrumented the code to insert a perror and it returns no such file or directory. The new_cache flag remains 0. Is this some sort of kernel bug? Perhaps you could try 1) running exportfs under strace. I suggest strace -o /tmp/s.log -s 1024 exportfs ... Strace does not work in our environment as it has not been properly ported to mips. 2) AND enabling kernel debug messages rpcdebug -m nfsd -s export rpcdebug -m rpc -s cache I attach the dmesg output after enabling those flags. Zeugma-x-y are the clients to this server. Not sure if it means anything suspicious. Ani -- Greg Banks, RD Software Engineer, SGI Australian Software Group. The cake is *not* a lie. I don't speak for SGI. root:[EMAIL PROTECTED]:~# cat /proc/fs/nfs/exports # Version 1.1 # Path Client(Flags) # IPs /cf4 zeugma-1-3(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks) /cf2 zeugma-1-4(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks) /cf2/cpu3 zeugma-1-3(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks) /cf2/cpu4 zeugma-1-4(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks) /cf2/cpu2 zeugma-1-2(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks) /cf2 zeugma-1-2(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks) /cf4 zeugma-1-2(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks) /cf4 zeugma-1-4(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks) /cf2 zeugma-1-3(rw,insecure,no_root_squash,async,wdelay,no_subtree_check,insecure_locks)
Wondering about NLM_HOST_MAX ... doesn't anyone understand this code?
Hi, I've been looking at NLM_HOST_MAX in fs/lockd/host.c, as we have a patch in SLES that makes it configurable, and the patch needs to either go upstream or out the window... But the code that uses NLM_HOST_MAX is weird! Look: #define NLM_HOST_EXPIRE ((nrhosts NLM_HOST_MAX)? 300 * HZ : 120 * HZ) #define NLM_HOST_COLLECT((nrhosts NLM_HOST_MAX)? 120 * HZ : 60 * HZ) So if the number of hosts is more than the maximum (64), we *increase* the expiry time and the garbage collection interval. You would think they should be decreased when we have exceeded the max, so we can get rid of more old entries more quickly. But no, they are increased. And in the code where we add a host to the list of hosts: if (++nrhosts NLM_HOST_MAX) next_gc = 0; So when we go over the limit, we garbage collect straight away, but almost certainly do nothing because we've just given every host an extra 3 minutes that it is allowed to live. We could change the '' to '' which would make the code make sense at least, but I don't think we want to. A server could easily have more than 64 clients doing lock requests, and we don't want to make life harder for clients just because there are more of them. I think we should just get rid of NLM_HOST_MAX altogether. Old hosts will still go away in a few minutes and pushing them out quickly shouldn't be needed. So: any comments on the above or on the patch below. I've chosen to go with discard hosts older than 5 minutes every 2 minutes rather than discard hosts older than 2 minutes every minute even though the latter is what would have been in effect most of the time, as it seems more like what was intended. Thanks, NeilBrown Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./fs/lockd/host.c |8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff .prev/fs/lockd/host.c ./fs/lockd/host.c --- .prev/fs/lockd/host.c 2008-02-07 14:20:54.0 +1100 +++ ./fs/lockd/host.c 2008-02-07 14:23:38.0 +1100 @@ -19,12 +19,11 @@ #define NLMDBG_FACILITYNLMDBG_HOSTCACHE -#define NLM_HOST_MAX 64 #define NLM_HOST_NRHASH32 #define NLM_ADDRHASH(addr) (ntohl(addr) (NLM_HOST_NRHASH-1)) #define NLM_HOST_REBIND(60 * HZ) -#define NLM_HOST_EXPIRE((nrhosts NLM_HOST_MAX)? 300 * HZ : 120 * HZ) -#define NLM_HOST_COLLECT ((nrhosts NLM_HOST_MAX)? 120 * HZ : 60 * HZ) +#define NLM_HOST_EXPIRE(300 * HZ) +#define NLM_HOST_COLLECT (120 * HZ) static struct hlist_head nlm_hosts[NLM_HOST_NRHASH]; static unsigned long next_gc; @@ -142,9 +141,6 @@ nlm_lookup_host(int server, const struct INIT_LIST_HEAD(host-h_granted); INIT_LIST_HEAD(host-h_reclaim); - if (++nrhosts NLM_HOST_MAX) - next_gc = 0; - out: mutex_unlock(nlm_host_mutex); return host; - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: kernel exports table flushes out on running exportfs -a over mips
At a higher level, in general, I think the kernel exports table need not match /etc/exports at all. When we run exportfs -a again, what the codebase intends to do is the following: 1. Scan /etc/exports and verify that an entry exists (create one if not) in its in core exports table. Mark each of these as may_be_exported. 2. Scan /proc and see that each of the entries there has a corresponding entry in the in core exports table (a matching operation). If not, create a new entry. Mark all entries from /proc as exported. 3. If there are any entries that are *not* may_be_exported and yet exported, then issue the right rpc through /proc/net/sunrpc/app cache to delete that entry from the kernel table. In this case, the matching operation does not detect that a * in the hostname essentially means *anyone* can mount the volume, regardless of their specific names. As a result, duplicate entries are created and ultimately everything gets flushed out :( Any elegant suggestion/bugfix will be really appreciated. Ani -Original Message- From: Anirban Sinha Sent: Wednesday, February 06, 2008 6:20 PM To: Anirban Sinha; Greg Banks Cc: linux-nfs@vger.kernel.org Subject: RE: kernel exports table flushes out on running exportfs -a over mips Hi: I did some extensive digging into the codebase and I believe I have the reason why exportfs -a flushes out the caches after NFS clients have mounted the NFS filesystem. The analysis is complicated, but here's the crux of the matter: There is a difference in the /etc/exports and the kernel maintained cache. The difference is that in /etc/exports, we use anonymous clients (*) whereas kernel maintains a FQDN client names in its exports cache (see attached file). This difference (the parsing code client_gettype() specifically checks for a * or an IP or hostname among other things and based on that creates two different types of caches) is causing the nfs codebase to recreate new in core exports entries (the second time when we issue exportfs -a) after parsing /proc/fs/nfs/export. Immediately later, it then throws these away (for these newly created entries, m_mayexport = 0 and m_exported = 1 in function xtab_read()). For details, see the logic in exports_update_one(): if (exp-m_exported !exp-m_mayexport) { ... unexporting ...} Since both the anonymous and FQDN entries are essentially the same, this results in blowing away the existing kernel exports table. My question is, is there a elegant solution to this problem without simply using FQDN in /etc/exports? I have confirmed that the problem does not occur when both the in kernel and /etc/exports tables have same entries (both * or both FQDN). Cheers, Ani -Original Message- From: [EMAIL PROTECTED] [mailto:linux-nfs- [EMAIL PROTECTED] On Behalf Of Anirban Sinha Sent: Thursday, January 31, 2008 2:09 PM To: Greg Banks Cc: linux-nfs@vger.kernel.org Subject: RE: kernel exports table flushes out on running exportfs -a over mips Hi Greg: Thanks for replying. Here goes my response: -Original Message- From: Greg Banks [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 30, 2008 6:37 PM To: Anirban Sinha Cc: linux-nfs@vger.kernel.org Subject: Re: kernel exports table flushes out on running exportfs - a over mips On Wed, Jan 30, 2008 at 05:34:13PM -0800, Anirban Sinha wrote: Hi: I am seeing an unusual problem on running nfs server on mips. Over Intel this does not happen. When I run exportfs -a on the server when the clients have already mounted their nfs filesystem, the kernel exports table as can be seen from /proc/fs/nfs/exports gets completely flushed out. We (me and one another colleague) have done some digging (mostly looking into nfsutils codebase) and it looks like a kernel side issue. We had also asked folks in the linux-mips mailing list, but apparently no one has any clue. I am just hoping that those who are more familiar with the user level and kernel side of nfs might me something more to chew on. If you can give any suggestions that will be really useful. If you think the information I provided is not enough, I can give you any other information you need in this regard. Does the MIPS box have the /proc/fs/nfsd/ filesystem mounted? Ahh, I see what you mean. Yes, it is mounted, both /proc/fs/nfsd and /proc/fs/nfs. However, I can see from the code that check_new_cache() checks for a file filehandle which does not exist in that location. To be dead sure, I instrumented the code to insert a perror and it returns no such file or directory. The new_cache flag remains 0. Is this some sort of kernel bug? Perhaps you could try 1) running exportfs under strace. I suggest strace -o /tmp/s.log -s 1024 exportfs ... Strace does not work in our environment as it has not been
Re: [NFS] [PATCH] Make UDF exportable
On Wednesday February 6, [EMAIL PROTECTED] wrote: + dotdot.d_name.name = ..; + dotdot.d_name.len = 2; + + lock_kernel(); + if (!udf_find_entry(child-d_inode, dotdot, fibh, cfi)) + goto out_unlock; Have you ever tried this? I think this could never work. UDF doesn't have entry named .. in a directory. You have to search for an entry that has in fileCharacteristics set bit FID_FILE_CHAR_PARENT. Maybe you could hack-around udf_find_entry() to recognize .. dentry and do the search accordingly. Probably not. I just tested that I could read files and navigate the directory structure. However looking into UDF I think you are right - it will fail. I have extended udf_find_entry() to do an explicit check based on fileCharacteristics as you propose. How do I actually test this case? - Mount the filesystem from the server. - 'cd' a few directories down into the filesystem. - reboot the server(1) - on the client 'ls -l'. (1) A full reboot isn't needed. Just unexport, unmount, remount, re-export on the server. alternately, use a non-linux client and cd down into the filesystem and ls -l .. NeilBrown - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ NFS maillist - [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nfs ___ Please note that [EMAIL PROTECTED] is being discontinued. Please subscribe to linux-nfs@vger.kernel.org instead. http://vger.kernel.org/vger-lists.html#linux-nfs - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: kernel exports table flushes out on running exportfs -a over mips
Sorry, I does look like it indeed solved the problem. Clearly, I have missed something in my analysis of the codebase. In any case, thanks a lot. Good night, Ani -Original Message- From: Neil Brown [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 06, 2008 9:22 PM To: Anirban Sinha Cc: Greg Banks; linux-nfs@vger.kernel.org Subject: RE: kernel exports table flushes out on running exportfs -a over mips On Thursday January 31, [EMAIL PROTECTED] wrote: Does the MIPS box have the /proc/fs/nfsd/ filesystem mounted? Ahh, I see what you mean. Yes, it is mounted, both /proc/fs/nfsd and /proc/fs/nfs. However, I can see from the code that check_new_cache() checks for a file filehandle which does not exist in that location. To be dead sure, I instrumented the code to insert a perror and it returns no such file or directory. The new_cache flag remains 0. Is this some sort of kernel bug? OK, that means that /proc/fs/nfs is *not* mounted. /proc is mounted, and it contains several directories including /proc/fs/nfs and proc/fs/nfsd. To get modern NFS service, you need to mount -t nfsd nfsd /proc/fs/nfsd before running any nfsd related programs (e.g. mountd, nfsd). Most distro do that in their startup scripts. It seems you are missing this. However it should still work. It seems that it doesn't. I tried without /proc/fs/nfsd mounted and got the same result as you. It seems that we broke things when /var/lib/nfs/rmtab was changed to store IP addresses rather than host names. The following patch to nfs-utils will fix it. Or you can just mount the 'nfsd' filesystem as above. NeilBrown diff --git a/support/export/client.c b/support/export/client.c index 1cb242f..e96f5e0 100644 --- a/support/export/client.c +++ b/support/export/client.c @@ -462,5 +462,5 @@ client_gettype(char *ident) sp++; if(!isdigit(*sp) || strtoul(sp, sp, 10) 255 || *sp != '.') return MCL_FQDN; sp++; if(!isdigit(*sp) || strtoul(sp, sp, 10) 255 || *sp != '\0') return MCL_FQDN; /* we lie here a bit. but technically N.N.N.N == N.N.N.N/32 :) */ - return MCL_SUBNETWORK; + return MCL_FQDN; } - To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html