Re: scrub implies failing drive - smartctl blissfully unaware

Duncan Wed, 19 Nov 2014 16:26:25 -0800

Robert White posted on Wed, 19 Nov 2014 13:05:13 -0800 as excerpted:

> One of the reasons that the whole industry has started favoring
> point-to-point (SATA, SAS) or physical intercessor chaining
> point-to-point (eSATA) buses is to remove a lot of those wait-and-see
> delays.
> 
> That said, you should not see a drive (or target enclosure, or
> controller) "reset" during spin up. In a SCSI setting this is almost
> always a cabling, termination, or addressing issue. In IDE its jumper
> mismatch (master vs slave vs cable-select). Less often its a
> partitioning issue (trying to access sectors beyond the end of the
> drive).
> 
> Another strong actor is selecting the wrong storage controller chipset
> driver. In that case you may be faling back from high-end device you
> think it is, through intermediate chip-set, and back to ACPI or BIOS
> emulation


FWIW I run a custom-built monolithic kernel, with only the specific 
drivers (SATA/AHCI in this case) builtin.  There's no drivers for 
anything else it could fallback to.

Once in awhile I do see it try at say 6-gig speeds, then eventually fall 
back to 3 and ultimately 1.5, but that /is/ indicative of other issues 
when I see it.  And like I said, there's no other drivers to fall back 
to, so obviously I never see it doing that.

> Another common cause is having a dedicated hardware RAID controller
> (dell likes to put LSI MegaRaid controllers in their boxes for example),
> many mother boards have hardware RAID support available through the
> bios, etc, leaving that feature active, then the adding a drive and
> _not_ initializing that drive with the RAID controller disk setup. In
> this case the controller is going to repeatedly probe the drive for its
> proprietary controller signature blocks (and reset the drive after each
> attempt) and then finally fall back to raw block pass-through. This can
> take a long time (thirty seconds to a minute).

Everything's set JBOD here.  I don't trust those proprietary "firmware" 
raid things.  Besides, that kills portability.  JBOD SATA and AHCI are 
sufficiently standardized that should the hardware die, I can switch out 
to something else and not have to worry about rebuilding the custom 
kernel with the new drivers.  Some proprietary firmware raid, requiring 
dmraid at the software kernel level to support, when I can just as easily 
use full software mdraid on standardized JBOD, no thanks!

And be sure, that's one of the first things I check when I setup a new 
box, any so-called hardware raid that's actually firmware/software raid, 
disabled, JBOD mode, enabled.

> But seriously, if you are seeing "reset" anywhere in any storage chain
> during a normal power-on cycle then you've got a problem  with geometry
> or configuration.

IIRC I don't get it routinely.  But I've seen it a few times, attributing 
it as I said to the 30-second SATA level timeout not being long enough.

Most often, however, it's at resume, not original startup, which is 
understandable as state at resume doesn't match state at suspend/
hibernate.  The irritating thing, as previously discussed, is when one 
device takes long enough to come back that mdraid or btrfs drops it out, 
generally forcing the reboot I was trying to avoid with the suspend/
hibernate in the first place, along with a re-add and resync (for mdraid) 
or a scrub (for btrfs raid).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub implies failing drive - smartctl blissfully unaware

Reply via email to