On Tue, 20 Feb 2007 12:48:12 +0800, Marc Marais wrote
> On Mon, 19 Feb 2007 11:26:24 +0100 (MET), Mikael Pettersson wrote
> > On Mon, 19 Feb 2007 12:43:50 +0800, Marc Marais wrote:
> > > I've decided to post this to the linux-ide list to see if I can get to
the
> > > bottom of this problem I'm experiencing with sata_promise and my PATA
drives.
> > >
> > > I've pasted a thread from the linux-raid list where I was trying to
> > > troubleshoot/recover a destroyed raid5 array.
> > >
> > > First a full history:
> > >
> > > 1) 2.6.17.13: 3 drive PATA raid5 array with one drive starting to give
read
> > > errors (legitimate according to SMART logs).
> > > 2) System lockups (no kernel panic seen) during load - I suspect due
to the
> > > read error on the failing drive.
> > > 3) Decide to upgrade to 2.6.20
> > > 4) Raid5 issues occur (handling of read errors caused md device to
die).
> > > 5) Patch from Neil to fix raid-5 error handling
> > > 6) Replace failed drive and add a new drive at the same time to create
a 4
> > > drive PATA array.
> > > 7) Attempt to grow the array from 3 -> 4 devices which failed due to
an error
> > > similar to this:
> > >
> > > ata3: command timeout
> > > ata3: no sense translation for status: 0x40
> > > ata3: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00
> > > ata4: status=0x40 { DriveReady }
> > > sd 3:0:0:0: SCSI error: return code = 0x08000002
> > > sdd: Current [descriptor]: sense key: Aborted Command
> > > Additional sense: No additional sense information
> > > Descriptor sense data with sense descriptors (in hex):
> > > 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
> > > 00 00 00 00
> > > end_request: I/O error, dev sdc, sector 260419647
> > >
> > > 8) Raid array is trashed, rebuild array and restore from backup.
> > > 9) From this point on the system is up and running - restored to
working
> > > state. However, I'm still getting errors similar to the above during
array
> > > accesses (read/write). Not related to load. The array (being synced)
manages
> > > to continue operation using another drive. My concern is that this may
happen
> > > on a degraded array in future.
> > >
> > > Note that the error I'm getting (shown above) has happened on sdc and
sdd and
> > > at different sectors (i.e. not a consistent read error). Also, the
SMART logs
> > > for both drives show NO error at all, short and long SMART tests
complete
> > > successfully. I suspect this is an issue in the driver and/or my
physical
> > > TX4000 card.
> >
> > In the 2.6.20 kernel, 20619/TX4000 is still using the same driver
> > code and (old) error handling code it's been using for ages,
> > i.e., any 20619/TX4000 issues are unrelated to the SATAII and
> > new EH changes that I've done. Therefore I strongly suspect
> > either an old driver bug, or some hardware issue.
> >
> > >From your dmesg log it seems you have at least 7 disks and a DVD
> > drive on two different controllers, an unused AIC7XXX, and an e1000
> > NIC, on a mainboard with a pair of Athlon MPs and 2GB RAM. All that
> > screams "power consumption" and "heat generation". Please make
> > absolutely sure that the PSU and cooling solutions are up to the job.
> > It doesn't hurt to check the cables and that the card is properly
> > seated as well. I'm assuming each drive is jumpered as master and
> > is connected at the far end of its cable?
>
> I have been running this server for several years now in the same
> configuration. I was originally running 4 80G drives and the only
difference
> now is they have been upgraded to 4 160G drives. The system is very well
> cooled (CM Stacker case) and has a decent power supply which has
> been running it for some time now.
>
> However, I did reseat all cables and cards and also switched the IDE
> channels around on the TX4000 card. I haven't had an error yet but,
> like I mentioned, they are intermittent.
>
> > It would be very useful if you could move the drives around,
> > so the sdc/sdd drives that experienced errors are moved to the
> > ports now used by sda/sdb. That should tell us if the errors
> > are tied to the drives or the ports.
>
> I will keep monitoring and check if the errors occur on the sda/sdb drives
> since moving the drives around.
>
> Also, I saw a post on linux-kernel regarding another user seeing
> these 'command timeouts' (is that what they are?). If nothing can be
> done to prevent occassional timeouts then at least they need to
> handled property by retrying or whatever is best (I don't proclaim
> to have much inside knowledge of the kernel so have no idea how
> libata handles errors). In my case, the md layer was seeing the
> error and getting the data off another drive in the array which
> could potential cause a problem if an array is already degraded when
> this happens.
>
> Oh, and the aic7xxxx card IS being used - by an AIC tape drive ;)
>
> > /Mikael
> > -
>
> Thanks.
>
> Regards,
> Marc
> --
Replying to myself :)
Just an update. After switching the channels around I got some command
timeouts and drives sda and sdb which implies a problem with the drives,
however while examining the system I noticed the 6 pin aux power connector
on the motherboard was loose - I'm not sure what effect that had but I
noticed some MCE messages in the log (non-fatal correctable incident
occurred on CPU x) before the system hang (which I think is ECC memory
errors?).
If I get more timeouts I'm going to replace the power supply.
Anyway, sorry to burden the list with my problems, if you can take anything
from this to improve the kernel/libata/sata_promise then at least I've made
a contribution. Thanks for your time.
Regards,
Marc
--
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html