Re: [Linux-cluster] gfs_grow

James Chamberlain Thu, 09 Oct 2008 15:57:20 -0700

The gfs_grow did finally complete, but now I've got another problem:

Oct 9 17:55:49 s12n03 kernel: GFS: fsid=s12:scratch13.1: fatal:invalid metadata blockOct 9 17:55:49 s12n03 kernel: GFS: fsid=s12:scratch13.1: bh =4314413922 (type: exp=5, found=4)Oct 9 17:55:49 s12n03 kernel: GFS: fsid=s12:scratch13.1: function =gfs_get_meta_bufferOct 9 17:55:49 s12n03 kernel: GFS: fsid=s12:scratch13.1: file = /builddir/build/BUILD/gfs-kernel-2.6.9-75/smp/src/gfs/dio.c, line = 1223Oct 9 17:55:49 s12n03 kernel: GFS: fsid=s12:scratch13.1: time =1223589349Oct 9 17:55:49 s12n03 kernel: GFS: fsid=s12:scratch13.1: about towithdraw from the clusterOct 9 17:55:49 s12n03 kernel: GFS: fsid=s12:scratch13.1: waiting foroutstanding I/OOct 9 17:55:49 s12n03 kernel: GFS: fsid=s12:scratch13.1: telling LMto withdrawOct 9 17:55:50 s12n01 kernel: GFS: fsid=s12:scratch13.2: jid=1:Trying to acquire journal lock...

Oct  9 17:55:50 s12n01 kernel: GFS: fsid=s12:scratch13.2: jid=1: Busy

Oct 9 17:55:50 s12n02 kernel: GFS: fsid=s12:scratch13.0: jid=1:Trying to acquire journal lock...Oct 9 17:55:50 s12n02 kernel: GFS: fsid=s12:scratch13.0: jid=1:Looking at journal...Oct 9 17:55:51 s12n02 kernel: GFS: fsid=s12:scratch13.0: jid=1:Acquiring the transaction lock...Oct 9 17:55:51 s12n02 kernel: GFS: fsid=s12:scratch13.0: jid=1:Replaying journal...Oct 9 17:55:52 s12n02 kernel: GFS: fsid=s12:scratch13.0: jid=1:Replayed 1637 of 3945 blocksOct 9 17:55:52 s12n02 kernel: GFS: fsid=s12:scratch13.0: jid=1:replays = 1637, skips = 115, sames = 2193

Oct  9 17:55:52 s12n03 kernel: lock_dlm: withdraw abandoned memory

Oct 9 17:55:52 s12n02 kernel: GFS: fsid=s12:scratch13.0: jid=1:Journal replayed in 2s

Oct  9 17:55:52 s12n03 kernel: GFS: fsid=s12:scratch13.1: withdrawn
Oct  9 17:55:52 s12n02 kernel: GFS: fsid=s12:scratch13.0: jid=1: Done

Oct 9 17:56:26 s12n03 clurgmgrd: [6611]: <err> clusterfs:gfs-scratch13: Mount point is not accessible!Oct 9 17:56:26 s12n03 clurgmgrd[6611]: <notice> status onclusterfs:gfs-scratch13 returned 1 (generic error)Oct 9 17:56:26 s12n03 clurgmgrd[6611]: <notice> Stopping servicescratch13Oct 9 17:56:26 s12n03 clurgmgrd: [6611]: <info> Removing IPv4 address10.14.12.5 from bond0Oct 9 17:56:36 s12n03 clurgmgrd: [6611]: <err> /scratch13 is not adirectoryOct 9 17:56:36 s12n03 clurgmgrd[6611]: <notice> stop on nfsclient:nfs-scratch13 returned 2 (invalid argument(s))Oct 9 17:56:36 s12n03 clurgmgrd[6611]: <crit> #12: RG scratch13failed to stop; intervention requiredOct 9 17:56:36 s12n03 clurgmgrd[6611]: <notice> Service scratch13 isfailed

The history here is that a new storage shelf was added to thechassis. This somehow triggered an error on the chassis - a timeoutof some sort, as I understand it from Site Ops - which I presumetriggered this problem on this file system, since the two events werecoincident. I have run gfs_fsck against this file system, but itdidn't fix the problem - even when I used a newer version of gfs_fsckfrom RHEL 5 that had been back-ported to RHEL4. I had done this acouple of times before running the gfs_grow, and had hoped that theproblem had been taken care of. Apparently not. Does anyone have anythoughts here? I can make the file system available again by killingoff anything I suspect might be accessing that invalid metadata block,but that's not a good solution.


Thanks,

James

On Oct 9, 2008, at 11:18 AM, James Chamberlain wrote:

Thanks Andrew.
What I'm really hoping for is anything I can do to make thisgfs_grow go faster. It's been running for 19 hours now, I have noidea when it'll complete, and the file system I'm trying to grow hasbeen all but unusable for the duration. This is a very busy filesystem, and I know it's best to run gfs_grow on a quiet file system,but there isn't too much I can do about that. Alternatively, ifanyone knows of a signal I could send to gfs_grow that would causeit to give a status report or increase verbosity, that would behelpful, too. I have tried both increasing and decreasing thenumber of NFS threads, but since I can't tell where I am in theprocess or how quickly it's going, I have no idea what effect thishas on operations.
Thanks,

James

On Oct 8, 2008, at 5:12 PM, Andrew A. Neuschwander wrote:
James,
I have a CentOS 5.2 cluster where I would see the same nfs errorsunder certain conditions. If I did anything that introduced latencyto my gfs operations on the node that served nfs, the nfs threadscouldn't service requests faster than they came in from clients.Eventually my nfs threads would all be busy and start dropping nfsrequests. I kept an eye on my nfsd thread utilization (/proc/net/rpc/nfsd) and kept bumping up the number of threads until theycould handle all the requests while the gfs had a higher latency.
In my case, I had EMC Networker streaming data from my gfsfilesystems to a local scsi tape device on the same node thatserved nfs. I eventually separated them onto different nodes.
I'm sure gfs_grow would slow down your gfs enough that your nfsthreads couldn't keep up. NFS on gfs seems to be very latencysensitive. I have a quick an dirty perl script to generate ahistorgram image from nfs thread stats if you are interested.
-Andrew
--
Andrew A. Neuschwander, RHCE
Linux Systems/Software Engineer
College of Forestry and Conservation
The University of Montana
http://www.ntsg.umt.edu
[EMAIL PROTECTED] - 406.243.6310


James Chamberlain wrote:
Hi all,
I'd like to thank Bob Peterson for helping me solve the lastproblem I was seeing with my storage cluster. I've got a new onenow. A couple days ago, site ops plugged in a new storage shelfand this triggered some sort of error in the storage chassis. Iwas able to sort that out with gfs_fsck, and have since gotten thenew storage recognized by the cluster. I'd like to make use ofthis new storage, and it's here that we run into trouble.lvextend completed with no trouble, so I ran gfs_grow. gfs_growhas been running for over an hour now and has not progressed past:
[EMAIL PROTECTED] ~]# gfs_grow /dev/s12/scratch13
FS: Mount Point: /scratch13
FS: Device: /dev/s12/scratch13
FS: Options: rw,noatime,nodiratime
FS: Size: 4392290302
DEV: Size: 5466032128
Preparing to write new FS information...
The load average on this node has risen from its normal ~30-40 to513 (the number of nfsd threads, plus one), and the file systemhas become slow-to-inaccessible on client nodes. I am seeingmessages in my log files that indicate things like:Oct 8 16:26:00 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104when sending 140 bytes - shutting down socket
Oct  8 16:26:00 s12n01 last message repeated 4 times
Oct  8 16:26:00 s12n01 kernel: nfsd: peername failed (err 107)!
Oct  8 16:26:58 s12n01 kernel: nfsd: peername failed (err 107)!
Oct  8 16:27:56 s12n01 last message repeated 2 times
Oct 8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104when sending 140 bytes - shutting down socketOct 8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104when sending 140 bytes - shutting down socket
Oct  8 16:27:56 s12n01 kernel: nfsd: peername failed (err 107)!
Oct 8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104when sending 140 bytes - shutting down socketOct 8 16:27:56 s12n01 kernel: rpc-srv/tcp: nfsd: got error -104when sending 140 bytes - shutting down socket
Oct  8 16:27:56 s12n01 kernel: nfsd: peername failed (err 107)!
Oct  8 16:28:34 s12n01 last message repeated 2 times
Oct  8 16:30:29 s12n01 last message repeated 2 times
I was seeing similar messages this morning, but those went awaywhen I mounted this file system on another node in the cluster,turned on statfs_fast, and then moved the service to that node.I'm not sure what to do about it given that gfs_grow is running.Is this something anyone else has seen? Does anyone know what todo about this? Do I have any option other than to wait untilgfs_grow is done? Given my recent experiences (see"lm_dlm_cancel" in the list archives), I'm very hesitant to hit ^Con this gfs_grow. I'm running CentOS 4 for x86-64, kernel2.6.9-67.0.20.ELsmp.
Thanks,
James
--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] gfs_grow

Reply via email to