On Wed, Mar 30, 2011 at 11:02 PM, Daniel Pittman <[email protected]> wrote:
> Hey.
>
> So, my server has a fairly boring LSI JSOD SAS controller running the
> set of SATA disks, and it has taken recently (and newly) to throwing a
> whole pile of errors at me:
>
> mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ
> Fail All Commands After Error}, SubCode(0x0000)
> mptbase: ioc0: LogInfo(0x31181000): Originator={PL}, Code={IO
> Cancelled Due to Recieve Error}, SubCode(0x1000)

Which LSI SAS controller are you using?
Have you updated to the latest firmware?

>
> I tend to get clusters of like commands, and timing varies a bunch; at
> least some people report SMART commands triggering these errors, but I
> can't track any down or anything.
Yes you are right.  These messages are down at the Protocol layer.

>
> Annoyingly, they don't seem to indicate a specific unit is
> responsible, there is no useful documentation to decoding the meaning
> of the message, and no utilities to indicate what on earth the root
> cause is.  Worse, no other visible errors from the system, so
> presumably whatever it is does not propagate up to the kernel enough
> to trigger any higher level problems...
>
> So, can anyone advise on how to track down the root cause of these
> problems?  This is a production system, so I don't especially want to
> take it offline or anything, and I can't see any specific externally
> visible problems this is causing...
>

The commands that are getting these errors that you are seeing are either being
retried in the firmware layer of the card or in the Linux driver layer.

Also if this is a 3Gb SAS system, there is many fixes in the LSI
drivers from the LSI web site.
Most of the fixes are related to handling SATA devices on the SAS bus.
_______________________________________________
PLUG mailing list
[email protected]
http://lists.pdxlinux.org/mailman/listinfo/plug

Reply via email to