Hi.

Running snv_104 x86 against some very generic hardware as a testbed for some 
fun projects and as a home fileserver. Rough specifications of the host:

* Intel Q6600
* 6GB DDR2
* Multiple 250GB, 500GB SATA connected HDD's of mixed vendors
* Gigabyte GA-DQ6 series motherboard
* etc.

The problem or interesting scenario.

I decided to cron a zpool scrub of all three zpool's on the host system 
simultaneously. Something like this:

00 01 * * * /usr/sbin/zpool scrub backups > /dev/null 2>&1
00 01 * * * /usr/sbin/zpool scrub ztank > /dev/null 2>&1
00 01 * * * /usr/sbin/zpool scrub zebraware_root > /dev/null 2>&1

So, we know from reading documentation that:

[i]Because  scrubbing  and  resilvering  are  I/O-intensive
 operations, ZFS only allows one at a time. If a scrub is
 already in progress,  the  "zpool  scrub"  command  ter-
 minates  it  and starts a new scrub. If a resilver is in
 progress, ZFS does not allow a scrub to be started until
 the resilver completes[/i]

Please note the "[b]ZFS only allows one at a time[/b]" statement. Maybe 
relevant to what I'm about to explain. Maybe not.

I've noticed that when I lay my cron out in such a way two things happen:

1. On the "backups" pool, which is a simple zpool "stripe" with no redundancy, 
mirroring or anything of use, the pool will fault at some interminant point 
inside the scrub operation

2. The same thing will ALSO occur on the root pool (zebraware_root).

However, if the scrubs are cron'ed at DIFFERENT times, allowing a period of 
time to lapse where each will complete before the next starts, these errors are 
not presented in /var/adm/messages, and a "zpool status -x" reports all pools 
as healthy. It is only if the pools are cron'ed to scrub simultaneously will 
read errors occur. Some interesting output, occurring just after the 
simultaneous scrub starts on the three pools that exist on the host:

Dec 30 06:37:22 rapoosev5 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci8086,2...@1c,4/pci-...@0/i...@0 (ata0):
Dec 30 06:37:22 rapoosev5       timeout: abort request, target=0 lun=0
Dec 30 06:37:22 rapoosev5 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci8086,2...@1c,4/pci-...@0/i...@0 (ata0):
Dec 30 06:37:22 rapoosev5       timeout: abort device, target=0 lun=0
Dec 30 06:37:22 rapoosev5 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci8086,2...@1c,4/pci-...@0/i...@0 (ata0):
Dec 30 06:37:22 rapoosev5       timeout: reset target, target=0 lun=0
Dec 30 06:37:22 rapoosev5 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci8086,2...@1c,4/pci-...@0/i...@0 (ata0):
Dec 30 06:37:22 rapoosev5       timeout: reset bus, target=0 lun=0
Dec 30 06:37:22 rapoosev5 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci8086,2...@1c,4/pci-...@0/i...@0 (ata0):
Dec 30 06:37:22 rapoosev5       timeout: early timeout, target=1 lun=0
Dec 30 06:37:22 rapoosev5 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci8086,2...@1c,4/pci-...@0/i...@0 (ata0):
Dec 30 06:37:22 rapoosev5       timeout: early timeout, target=0 lun=0
Dec 30 06:37:22 rapoosev5 gda: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci8086,2...@1c,4/pci-...@0/i...@0/c...@0,0 (Disk0):
Dec 30 06:37:22 rapoosev5       Error for command 'read sector' Error Level: 
Informational
Dec 30 06:37:22 rapoosev5 gda: [ID 107833 kern.notice]  Sense Key: aborted 
command
Dec 30 06:37:22 rapoosev5 gda: [ID 107833 kern.notice]  Vendor 'Gen-ATA ' error 
code: 0x3
Dec 30 06:37:22 rapoosev5 gda: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci8086,2...@1c,4/pci-...@0/i...@0/c...@1,0 (Disk1):
Dec 30 06:37:22 rapoosev5       Error for command 'read sector' Error Level: 
Informational
Dec 30 06:37:22 rapoosev5 gda: [ID 107833 kern.notice]  Sense Key: aborted 
command
Dec 30 06:37:22 rapoosev5 gda: [ID 107833 kern.notice]  Vendor 'Gen-ATA ' error 
code: 0x3
Dec 30 06:37:22 rapoosev5 gda: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci8086,2...@1c,4/pci-...@0/i...@0/c...@0,0 (Disk0):
Dec 30 06:37:22 rapoosev5       Error for command 'read sector' Error Level: 
Informational
Dec 30 06:37:22 rapoosev5 gda: [ID 107833 kern.notice]  Sense Key: aborted 
command
Dec 30 06:37:22 rapoosev5 gda: [ID 107833 kern.notice]  Vendor 'Gen-ATA ' error 
code: 0x3

Shortly after this, we'll see:

Jan  1 06:39:58 rapoosev5 fmd: [ID 441519 daemon.error] SUNW-MSG-ID: 
ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
Jan  1 06:39:58 rapoosev5 EVENT-TIME: Thu Jan  1 06:39:56 EST 2009
Jan  1 06:39:58 rapoosev5 PLATFORM: P35-DQ6, CSN:  , HOSTNAME: rapoosev5
Jan  1 06:39:58 rapoosev5 SOURCE: zfs-diagnosis, REV: 1.0
Jan  1 06:39:58 rapoosev5 EVENT-ID: e6d95684-5ec0-4897-d761-b7e16ed40f2c
Jan  1 06:39:58 rapoosev5 DESC: The number of I/O errors associated with a ZFS 
device exceeded
Jan  1 06:39:58 rapoosev5            acceptable levels.  Refer to 
http://sun.com/msg/ZFS-8000-FD for more information.

And bang. Part of a pool is taken offline. We all know where that ends up. At 
this point, I can issue a "zpool clear" to the filesystems in question, and the 
pool clears, comes back online without any issues at all. Further to this, it 
is only ever READ ERRORS in a "zpool status" output that shows up. Never write 
errors, nor checksum validation problems.

So, my puzzling thoughts:

1. Am I just experiencing some form of crappy consumer grade controller I/O 
limitations or an issue of the controllers on this consumer grade kit not being 
up to the task of handling multiple scrubs occurring on different filesystems 
at any given time>?

2. Is this natural and to be expected (and moreover, am I breaking the rules) 
by attempting to scrub more than one pool at once - ergo [i]"well, what did you 
expect?[/i]"

Out of fear and sensibility, I've never simultaneously scrubbed production 
pools on our 6 series arrays at work, or for anything that actually matters - 
but I am interested in getting to the bottom of this, all the same.

Thanks!

z
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to