On Jun 27, 2008, at 2:22 PM, Bernd Schubert wrote: > On Fri, Jun 27, 2008 at 01:44:13PM -0400, Brock Palen wrote: >> On Jun 27, 2008, at 1:39 PM, Bernd Schubert wrote: >>> On Fri, Jun 27, 2008 at 01:07:32PM -0400, Brian J. Murrell wrote: >>>> On Fri, 2008-06-27 at 12:44 -0400, Brock Palen wrote: >>>>> >>>>> All of them are stuck in un-interruptible sleep. >>>>> Has anyone seen this happen before? Is this caused by a pending >>>>> disk >>>>> failure? >>>> >>>> Well, they are certainly stuck because of some blocking I/O. That >>>> could >>>> be disk failure, indeed. >>>> >>>>> mptscsi: ioc1: attempting task abort! (sc=0000010038904c40) >>>>> scsi1 : destination target 0, lun 0 >>>>> command = Read (10) 00 75 94 40 00 00 10 00 00 >>>>> mptscsi: ioc1: task abort: SUCCESS (sc=0000010038904c40) >>>> >>>> That does not look like a picture of happiness, indeed, no. You >>>> have >>>> SCSI commands aborting. >>>> >>> >>> Well, these messages are not nice of course, since the mpt error >>> handler >>> got activated, but in principle a scsi device can recover then. >>> Unfortunately, the verbosity level of scsi makes it impossbible to >>> figure out what was actually the problem. Since we suffered from >>> severe >>> scsi problems, I wrote quite a number of patches to improve the >>> situation. >>> We now at least can understand where the problem came from and also >>> have >>> a slightly improved error handling. These are presently for 2.6.22 >>> only, >>> but my plan is to sent these upstream for 2.6.28. >>> >>> >>>>> Lustre: 6698:0:(lustre_fsfilt.h:306:fsfilt_setattr()) nobackup- >>>>> OST0001: slow setattr 100s >>>>> Lustre: 6698:0:(watchdog.c:312:lcw_update_time()) Expired watchdog >>>>> for pid 6698 disabled after 103.1261s >>>> >>>> Those are just fallout from the above disk situation. >>> >>> Probably the device was offlined and actually this also should have >>> been >>> printed in the logs. Brock, can you check the device status >>> (cat /sys/block/sdX/device/state). >> >> IO Is still flowing from both OST's on that OSS, >> >> [EMAIL PROTECTED] ~]# cat /sys/block/sd*/device/state >> running >> running > > So the device recovered. Is the parallel-scsi? If so it now might > run at > a lower scsi speed level, but you should have got domain validation > messages > about this (unless you are using a customized driver, which has DV > disabled).
Its Fibre Channel for the medium. Direct connected (no loop or switch) So I am not sure, the driver is the stock one with RHEL4. > > > Cheers, > Bernd > > _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss