Re: [OmniOS-discuss] disk failure causing reboot?

Rune Tipsmark Tue, 19 May 2015 02:22:50 -0700

Same issue here around two months ago when a L2arc device failed… failmode was 
default and the device was actually an mSata SSD mounted in a PCI-E mSata card:


http://www.addonics.com/products/ad4mspx2.php  and the disk was one of four of 
these http://www.samsung.com/us/computer/memory-storage/MZ-MTE1T0BW

Can these reboots be avoided in any way?

Br,
Rune


From: OmniOS-discuss [mailto:[email protected]] On Behalf 
Of Schweiss, Chip
Sent: Monday, May 18, 2015 10:31 PM
To: Paul B. Henson
Cc: omnios-discuss
Subject: Re: [OmniOS-discuss] disk failure causing reboot?

I had the exact same failure mode last week.  With over 1000 spindles I see 
this about once a month.

I can publish my dump also if anyone actually want's to try to fix this 
problem, but I think there are several of the same thing already linked to 
tickets in Illumos-gate.
Pools for the most part should be set to failmode=panic or wait, but a failed 
disk should not cause a panic.   The system this happened to me on failmode was 
set to wait.  It is also on r151012, waiting on a window to upgrade to r151014. 
 My pool is raidz3, so no reason not to kick a bad disk.
All my disks are SAS in DataON JBODs, dual connected across two LSI HBAs.    
BTW, pull a SAS cable and you get a panic too, not degraded multipath.    
Illumos seems to panic on just about any SAS event these days regardless of 
redundancy.
-Chip









On Mon, May 18, 2015 at 3:08 PM, Paul B. Henson 
<[email protected]<mailto:[email protected]>> wrote:
On Mon, May 18, 2015 at 06:25:34PM +0000, Jeff Stockett wrote:
> A drive failed in one of our supermicro 5048R-E1CR36L servers running
> omnios r151012 last night, and somewhat unexpectedly, the whole system
> seems to have panicked.

You don't happen to have failmode set to panic on the pool?

From the zpool manpage:

       failmode=wait | continue | panic
           Controls the system behavior in the event of catastrophic pool
           failure. This condition is typically a result of a loss of
           connectivity to the underlying storage device(s) or a failure of
           all devices within the pool. The behavior of such an event is
           determined as follows:

           wait
                       Blocks all I/O access until the device connectivity is
                       recovered and the errors are cleared. This is the
                       default behavior.

           continue
                       Returns EIO to any new write I/O requests but allows
                       reads to any of the remaining healthy devices. Any
                       write requests that have yet to be committed to disk
                       would be blocked.

           panic
                       Prints out a message to the console and generates a
                       system crash dump.

_______________________________________________
OmniOS-discuss mailing list
[email protected]<mailto:[email protected]>
http://lists.omniti.com/mailman/listinfo/omnios-discuss

_______________________________________________
OmniOS-discuss mailing list
[email protected]
http://lists.omniti.com/mailman/listinfo/omnios-discuss

Re: [OmniOS-discuss] disk failure causing reboot?

Reply via email to