On Fri, Jun 27, 2008 at 02:29:24PM -0400, Brock Palen wrote: > On Jun 27, 2008, at 2:22 PM, Bernd Schubert wrote: >> On Fri, Jun 27, 2008 at 01:44:13PM -0400, Brock Palen wrote: >>> On Jun 27, 2008, at 1:39 PM, Bernd Schubert wrote: >>>> On Fri, Jun 27, 2008 at 01:07:32PM -0400, Brian J. Murrell wrote: >>>>> On Fri, 2008-06-27 at 12:44 -0400, Brock Palen wrote: >>>>>> >>>>>> All of them are stuck in un-interruptible sleep. >>>>>> Has anyone seen this happen before? Is this caused by a pending >>>>>> disk >>>>>> failure? >>>>> >>>>> Well, they are certainly stuck because of some blocking I/O. That >>>>> could >>>>> be disk failure, indeed. >>>>> >>>>>> mptscsi: ioc1: attempting task abort! (sc=0000010038904c40) >>>>>> scsi1 : destination target 0, lun 0 >>>>>> command = Read (10) 00 75 94 40 00 00 10 00 00 >>>>>> mptscsi: ioc1: task abort: SUCCESS (sc=0000010038904c40) >>>>> >>>>> That does not look like a picture of happiness, indeed, no. You >>>>> have >>>>> SCSI commands aborting. >>>>> >>>> >>>> Well, these messages are not nice of course, since the mpt error >>>> handler >>>> got activated, but in principle a scsi device can recover then. >>>> Unfortunately, the verbosity level of scsi makes it impossbible to >>>> figure out what was actually the problem. Since we suffered from >>>> severe >>>> scsi problems, I wrote quite a number of patches to improve the >>>> situation. >>>> We now at least can understand where the problem came from and also >>>> have >>>> a slightly improved error handling. These are presently for 2.6.22 >>>> only, >>>> but my plan is to sent these upstream for 2.6.28. >>>> >>>> >>>>>> Lustre: 6698:0:(lustre_fsfilt.h:306:fsfilt_setattr()) nobackup- >>>>>> OST0001: slow setattr 100s >>>>>> Lustre: 6698:0:(watchdog.c:312:lcw_update_time()) Expired watchdog >>>>>> for pid 6698 disabled after 103.1261s >>>>> >>>>> Those are just fallout from the above disk situation. >>>> >>>> Probably the device was offlined and actually this also should have >>>> been >>>> printed in the logs. Brock, can you check the device status >>>> (cat /sys/block/sdX/device/state). >>> >>> IO Is still flowing from both OST's on that OSS, >>> >>> [EMAIL PROTECTED] ~]# cat /sys/block/sd*/device/state >>> running >>> running >> >> So the device recovered. Is the parallel-scsi? If so it now might run >> at >> a lower scsi speed level, but you should have got domain validation >> messages >> about this (unless you are using a customized driver, which has DV >> disabled). > > Its Fibre Channel for the medium. Direct connected (no loop or switch) > So I am not sure, the driver is the stock one with RHEL4. >
Ok, quite different then. I only have very little experience with FC, so no idea what's wrong with your system now. Cheers, Bernd _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
