Thanks Andrew.
What I'm really hoping for is anything I can do to make this gfs_grow
go faster. It's been running for 19 hours now, I have no idea when
it'll complete, and the file system I'm trying to grow has been all
but unusable for the duration. This is a very busy file system, and I
know it's best to run gfs_grow on a quiet file system, but there isn't
too much I can do about that. Alternatively, if anyone knows of a
signal I could send to gfs_grow that would cause it to give a status
report or increase verbosity, that would be helpful, too. I have
tried both increasing and decreasing the number of NFS threads, but
since I can't tell where I am in the process or how quickly it's
going, I have no idea what effect this has on operations.
Thanks,
James
On Oct 8, 2008, at 5:12 PM, Andrew A. Neuschwander wrote:
James,
I have a CentOS 5.2 cluster where I would see the same nfs errors
under certain conditions. If I did anything that introduced latency
to my gfs operations on the node that served nfs, the nfs threads
couldn't service requests faster than they came in from clients.
Eventually my nfs threads would all be busy and start dropping nfs
requests. I kept an eye on my nfsd thread utilization (/proc/net/rpc/
nfsd) and kept bumping up the number of threads until they could
handle all the requests while the gfs had a higher latency.
In my case, I had EMC Networker streaming data from my gfs
filesystems to a local scsi tape device on the same node that served
nfs. I eventually separated them onto different nodes.
I'm sure gfs_grow would slow down your gfs enough that your nfs
threads couldn't keep up. NFS on gfs seems to be very latency
sensitive. I have a quick an dirty perl script to generate a
historgram image from nfs thread stats if you are interested.
-Andrew
--
Andrew A. Neuschwander, RHCE
Linux Systems/Software Engineer
College of Forestry and Conservation
The University of Montana
http://www.ntsg.umt.edu
[EMAIL PROTECTED] - 406.243.6310
James Chamberlain wrote:
Hi all,
I'd like to thank Bob Peterson for helping me solve the last
problem I was seeing with my storage cluster. I've got a new one
now. A couple days ago, site ops plugged in a new storage shelf
and this triggered some sort of error in the storage chassis. I
was able to sort that out with gfs_fsck, and have since gotten the
new storage recognized by the cluster. I'd like to make use of
this new storage, and it's here that we run into trouble.
lvextend completed with no trouble, so I ran gfs_grow. gfs_grow
has been running for over an hour now and has not progressed past:
[EMAIL PROTECTED] ~]# gfs_grow /dev/s12/scratch13
FS: Mount Point: /scratch13
FS: Device: /dev/s12/scratch13
FS: Options: rw,noatime,nodiratime
FS: Size: 4392290302
DEV: Size: 5466032128
Preparing to write new FS information...
The load average on this node has risen from its normal ~30-40 to
513 (the number of nfsd threads, plus one), and the file system has
become slow-to-inaccessible on client nodes. I am seeing messages
in my log files that indicate things like:
Oct 8 16:26:00 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104
when sending 140 bytes - shutting down socket
Oct 8 16:26:00 s12n01 last message repeated 4 times
Oct 8 16:26:00 s12n01 kernel: nfsd: peername failed (err 107)!
Oct 8 16:26:58 s12n01 kernel: nfsd: peername failed (err 107)!
Oct 8 16:27:56 s12n01 last message repeated 2 times
Oct 8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104
when sending 140 bytes - shutting down socket
Oct 8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104
when sending 140 bytes - shutting down socket
Oct 8 16:27:56 s12n01 kernel: nfsd: peername failed (err 107)!
Oct 8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104
when sending 140 bytes - shutting down socket
Oct 8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104
when sending 140 bytes - shutting down socket
Oct 8 16:27:56 s12n01 kernel: nfsd: peername failed (err 107)!
Oct 8 16:28:34 s12n01 last message repeated 2 times
Oct 8 16:30:29 s12n01 last message repeated 2 times
I was seeing similar messages this morning, but those went away
when I mounted this file system on another node in the cluster,
turned on statfs_fast, and then moved the service to that node.
I'm not sure what to do about it given that gfs_grow is running.
Is this something anyone else has seen? Does anyone know what to
do about this? Do I have any option other than to wait until
gfs_grow is done? Given my recent experiences (see "lm_dlm_cancel"
in the list archives), I'm very hesitant to hit ^C on this
gfs_grow. I'm running CentOS 4 for x86-64, kernel
2.6.9-67.0.20.ELsmp.
Thanks,
James
--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster