Re: [zfs-discuss] Resilver restarting several times

Jim Klimov Sat, 12 May 2012 07:57:48 -0700

Thanks for staying tuned! ;)

2012-05-12 18:34, Richard Elling wrote:

On May 12, 2012, at 4:52 AM, Jim Klimov wrote:

2012-05-11 14:22, Jim Klimov wrote:

What conditions can cause the reset of the resilvering
process? My lost-and-found disk can't get back into the
pool because of resilvers restarting...


FOLLOW-UP AND NEW QUESTIONS

Here is a new piece of evidence - I've finally got something
out of fmdump - series of several (5) retries ending with a
fail, dated 75 seconds before resilvers restart (more below).
Not a squeak in zpool status nor dmesg nor /dev/console.

Guess I must assume that the disk is dying indeed, losing
connection or something like that after a random time (my
resilvers restart after 15min-5hrs), and at least a run of
SMART long diags is in order, while the pool would try to
rebuild onto another disk (the hotspare) instead of trying
to update this one which was in the pool.


Please share if SMART offers anything useful.


I plan to run the tests after (if, when) the resilver completes
as to not disturb the system. So far I got smartmontools-5.42
compiled and it sees the disks :)

I am not sure if it would indeed run the self-tests and
report back - this seems to be not supported by disk(?):


# /usr/local/sbin/smartctl -x /dev/rdsk/c1t2d0 -d scsi
smartctl 5.42 2011-10-20 r3458 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

User Capacity:        250,056,000,000 bytes [250 GB]
Logical block size:   512 bytes
Serial number:                    5QE5ADXW
Device type:          disk
Local Time is:        Sat May 12 18:40:53 2012 MSK
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature:     26 C

Error Counter logging not supported
No self-tests have been logged
Device does not support Background scan results logging
scsiPrintSasPhy Log Sense Failed [unsupported field in scsi command]

I also guess that the disk gets found after something like
an unlogged bus reset or whatever, and this event causes
the resilvering to restart from scratch.


This makes sense.


Good, thanks for the sanity-check ;)

Best course of action would be to get those people to fully
replace the untrustworthy disk... Or at least pull and push
it a bit - maybe it's contacts just got plain dirty/oxidized
and the disk should be re-seated in the enclosure...


Not likely to help. SATA?


Yes, 250Gb Seagate SATA on a Thumper X4500, used since about
May 2008 (last rpool installation date) or maybe before.

      10. c1t2d0 <ATA-SEAGATE ST32500N-3AZQ-232.88GB>
          /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0

Thanks,
//Jim
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Resilver restarting several times

Reply via email to