>From ~ an hour's googling and reading, it looks like this (not uncommon) bug/warning/error has not necessarily been associated with data loss, but we are finding that our gluster fs is interrupting our cluster jobs with the 'Stale NFS handle' Warnings like this (on the client):
[2013-01-03 12:30:59.149230] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0- gl-client-0: remote operation failed: Stale NFS file handle. Path: /bio/krthornt/build_div/yak/line06_CY08A/prinses (3b0aa7b2-bf7f-4b27- b515-32e94b1206e3) (and 7 more, differing by the timestamp of <<1s). The dir mentioned existed before the job was asked to read from it and shortly after the SGE failed, I checked that the glusterfs (/bio) was still mounted and that dir was still r/w. We are getting these errors infrequently, but fairly regularly (a couple times a week, usually during a big array job that heavily reads from a particular dir) and I haven't seen any resolutions of the fault besides the vocabulary being corrected. I know it's not nec an NFS problem, but I haven't seen a fix from the gluster folks. Our glusterfs on this system is set up like this (over QDR/tcpoib) $ gluster volume info gl Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: auth.allow: 10.2.*.*,10.1.*.* performance.io-thread-count: 64 performance.quick-read: on performance.io-cache: on nfs.disable: on performance.cache-size: 268435456 performance.flush-behind: on performance.write-behind-window-size: 1024MB and otherwise appears to be happy. We were having a low-level problem with the RAID servers, where this LSI/3ware error was temporally close (~2m) to the gluster error: LSI 3DM2 alert -- host: biostor4.oit.uci.edu Jan 03, 2013 03:32:09PM - Controller 6 ERROR - Drive timeout detected: encl=1, slot=3 This error seemed to be related to construction around our data center and dust related with it. We have had 10s of these LSI/3ware errors with no related gluster errors or apparent problems with the RAIDs. No drives were ejected from the RAIDs and the errors did not repeat. 3ware explains: <http://cholla.mmto.org/computers/3ware/3dm2/en/3DM_2_OLH-8-6.html> ============================== 009h Drive timeout detected The 3ware RAID controller has a sophisticated recovery mechanism to handle various types of failures of a disk drive. One such possible failure of a disk drive is a failure of a command that is pending from the 3ware RAID controller to complete within a reasonable amount of time. If the 3ware RAID controller detects this condition, it notifies the user, prior to entering the recovery phase, by displaying this AEN. Possible causes of APORT time-outs include a bad or intermittent disk drive, power cable or interface cable. ================================ We've checked into this and it doesn't seem to be related, but I thought I'd bring it up. hjm On Thursday, August 23, 2012 09:54:13 PM Joe Julian wrote: > *Bug 832694* <https://bugzilla.redhat.com/show_bug.cgi?id=832694> > -ESTALE error text should be reworded > > On 08/23/2012 09:50 PM, Kaushal M wrote: > > The "Stale NFS file handle" message is the default string given by > > strerror() for errno ESTALE. > > Gluster uses ESTALE as errno to indicate that the file being referred > > to no longer exists, ie. the reference is stale. > > > > - Kaushal > > > > On Fri, Aug 24, 2012 at 7:03 AM, Jules Wang <[email protected] > > > > <mailto:[email protected]>> wrote: > > Hi, Jon , > > > > I also met the same issue, and reported a > > > > bug(https://bugzilla.redhat.com/show_bug.cgi?id=851381) > > <https://bugzilla.redhat.com/show_bug.cgi?id=851381> > > > > In the bug report, I share a simple way to reproduce the bug. > > Have fun. > > > > Best Regards. > > Jules Wang. > > > > At 2012-08-23 23:02:34,"Bùi Hùng Vie^.t" <[email protected] > > > > <mailto:[email protected]>> wrote: > > Hi Jon, > > I have no answer for you. Just want to share with you guys > > that I met the same issue with this message. In my gluster > > system, Gluster client log files have a lot of these messages. > > I tried to ask and found nothing on the Web. Amazingly, > > Gluster have been running for long time :) > > > > On Thu, Aug 23, 2012 at 8:43 PM, Jon Tegner <[email protected] > > > > <mailto:[email protected]>> wrote: > > Hi, I'm a bit curious of error messages of the type > > "remote operation failed: Stale NFS file handle". All > > clients using the file system use Gluster Native Client, > > so why should stale nfs file handle be reported? > > > > Regards, > > > > /jon > > > > > > > > _______________________________________________ > > Gluster-users mailing list > > [email protected] <mailto:[email protected]> > > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > > > > _______________________________________________ > > Gluster-users mailing list > > [email protected] <mailto:[email protected]> > > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > > > > _______________________________________________ > > Gluster-users mailing list > > [email protected] > > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) _______________________________________________ Gluster-users mailing list [email protected] http://supercolony.gluster.org/mailman/listinfo/gluster-users
