On Monday, March 18, 2013 08:59:57 AM Nicholas Leippe wrote: > On Mon, Mar 18, 2013 at 12:05 AM, Dan Egli <[email protected]> wrote: > > *All this discussion about raid levels and what not has brought to my mind > > a different, if related, question. One of the reasons I like software raid > > is that it's easy to monitor. For example, I could have a cron script that > > runs once every 15 minutes for example and checks the status of the > > /proc/mdstat file to ensure any raid(s) listed show status of Healthy. But > > how do you do something like that for a Hardware raid? How can you tell, > > for example, if drive #3 in a HW raid10 has failed? This is something I > > honestly don't know off my head. I know many of you folks have had > > experience with HW raid and device failures in that array. How do you > > know? > > There's no file you can check like mdstat is there? I'd think this would > > be > > especially important for remote hosted/co-located servers.* > > IME it's vendor specific. Some of the cards I've used had their own > monitoring software. Others had a utility that you could use to query > and thus write your own monitoring plugin. Some had nothing--they > would just beep and then you'd have to use their access tool (a front > end to their bios software) and navigate their menus to figure it out > and deal with it--*possibly* could be automated via an expect script, > but not easily--navigating an ncurses-type interface.
For example, Dell has monitoring tools (Open Manage) for its servers, and there is a Nagios plugin you can use to monitor the the health of the RAID and other hardware. When I got it going where I am at, we quickly found two servers with a degraded RAID that needed fixing. I also was able to find a couple of other hardware problems that Dell fixed for me. /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */
