Re: Multiple errors on server -- Where do I start looking?

Chuck Swiger Mon, 06 Feb 2012 10:51:29 -0800

On Feb 6, 2012, at 8:15 AM, Ryan Merrell wrote:
> We have an Intel modular blade server. The chassis has 2x 3-disk RAID(5) 
> arrays. Volume 1 is what the OS (FreeBSD 7.2) is installed on and Volume 2 is 
> mounted at /usr. These two volumes are da0 and da1.


This doesn't matter directly to your issue, but a 3-disk RAID-5 setup is not a 
great choice.  With six disks available, you'd almost certainly do better 
either as a 6-disk-wide RAID-5 or a RAID-10.

> I got email notifications saying the web host I run in a jail hosted on this 
> server was down. I try to SSH into it, but it fails. I ping it and I get a 
> 50% return rate. So I log in to the management blade and start a virtual KVM 
> sessions to get into the blade. Once I'm into the basehost blade, I cat 
> dmesg.today and get a slew of errors. Here we go..
> (da3:mpt0:0:6:1): Logical unit not accessible, target port in standby state
> (da3:mpt0:0:6:1): Retrying Command (per Sense Data)
> (da3:mpt0:0:6:1): READ(10). CDB: 28 0 0 0 0 0 0 0 1 0
> (da3:mpt0:0:6:1): CAM Status: SCSI Status Error
> (da3:mpt0:0:6:1): SCSI Status: Check Condition
> (da3:mpt0:0:6:1): ILLEGAL REQUEST asc:4,b
> (da3:mpt0:0:6:1): Logical unit not accessible, target port in standby state
> (da3:mpt0:0:6:1): Retrying Command (per Sense Data)
> (da3:mpt0:0:6:1): READ(10). CDB: 28 0 0 0 0 0 0 0 1 0
> (da3:mpt0:0:6:1): CAM Status: SCSI Status Error
> (da3:mpt0:0:6:1): SCSI Status: Check Condition
> (da3:mpt0:0:6:1): ILLEGAL REQUEST asc:4,b
> (da3:mpt0:0:6:1): Logical unit not accessible, target port in standby state
> (da3:mpt0:0:6:1): Retries Exhausted
> 
> As mentioned before, our two volumes are da0 and da1. /dev lists da2 and da3 
> as well, but I have no idea what they are.  How do I figure out what da3 is 
> and what do the above error messages say about it? Someone on the forum asked 
> me if the two volumes are on the same controller and the answer is yes, they 
> are.

Check a dmesg after a reboot, or take a look at "camcontrol devlist" or 
"atacontrol list" and that ought to provide more information.  Since you're 
also using GEOM labels, "glabel status" is likely to be informative as well.

> GEOM_LABEL: Label for provider da0s1a is ufsid/4aeb03874c64d9f1.
> GEOM_LABEL: Label for provider da0s1d is ufsid/4aeb038ae8ae24cf.
> GEOM_LABEL: Label for provider da0s1e is ufsid/4aeb0387d999941a.
> GEOM_LABEL: Label for provider da0s1f is ufsid/4aeb038766c4c807.
> Trying to mount root from ufs:/dev/da0s1a
> GEOM_LABEL: Label ufsid/4aeb03874c64d9f1 removed.
> GEOM_LABEL: Label for provider da0s1a is ufsid/4aeb03874c64d9f1.
> GEOM_LABEL: Label ufsid/4aeb0387d999941a removed.
> GEOM_LABEL: Label ufsid/4bd2077f23a6cc93 removed.
> GEOM_LABEL: Label for provider da0s1e is ufsid/4aeb0387d999941a.
> GEOM_LABEL: Label for provider da1s1 is ufsid/4bd2077f23a6cc93.
> GEOM_LABEL: Label ufsid/4aeb038766c4c807 removed.
> GEOM_LABEL: Label for provider da0s1f is ufsid/4aeb038766c4c807.
> GEOM_LABEL: Label ufsid/4aeb038ae8ae24cf removed.
> GEOM_LABEL: Label for provider da0s1d is ufsid/4aeb038ae8ae24cf.
> GEOM_LABEL: Label ufsid/4aeb03874c64d9f1 removed.
> GEOM_LABEL: Label ufsid/4aeb0387d999941a removed.
> GEOM_LABEL: Label ufsid/4aeb038766c4c807 removed.
> GEOM_LABEL: Label ufsid/4aeb038ae8ae24cf removed.
> GEOM_LABEL: Label ufsid/4bd2077f23a6cc93 removed.
> 
> Was root unmounted? Whats going on here? Obviously there's some issue with 
> da0, which is mounted at /. The server has been up and running fine, so why 
> am I seeing "Trying to mount root from ufs:/dev/da0s1a"?

These are standard messages from GEOM-- it's trying to look at the disk labels 
and figure out where to mount the various filesystems.

> pid 93248 (httpd), uid 80: exited on signal 10
> pid 95624 (httpd), uid 80: exited on signal 10
> pid 97956 (httpd), uid 80: exited on signal 10
> pid 97935 (httpd), uid 80: exited on signal 10
> pid 96603 (httpd), uid 80: exited on signal 10
> pid 93210 (httpd), uid 80: exited on signal 10
> pid 98246 (httpd), uid 80: exited on signal 10
> 
> This is apparently whats killing our webserver. Apache receives a signal 10 
> and quits.. Everything I've read says it's an issue with Apache trying to 
> access RAM that it shouldn't or that doesn't exist.. Is there something else 
> with the above da0 or da3 errors that would cause a SIGBUS on httpd?

That's unclear, but normally a failing disk will cause I/O to block and the 
httpds will simply hang, not crash.

Most likely, you've got a bug lurking in one of the Apache modules you use 
(mod_php is a likely candidate), but run a test instance of httpd under gdb 
using -X flag, and see whether you can gain better information.  Or unlimit 
coredumpsize, and run gdb against the corefile to see what's causing the crash.

Regards,
-- 
-Chuck

_______________________________________________
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"

Re: Multiple errors on server -- Where do I start looking?

Reply via email to