Hi,
I don't know how far apart you added memory from the time you bought/built your server. I say that because the drams on the memory might be slightly different. When we build servers, we use a particular brand for certain reasons but one of those reason is the fact the dram specs do not change on a given sku.

Here is what I recommend you try.

Take out all the memory. Add one dimm only. See if problem persists. If problem stops, add second dimm. Still good, add 3rd dimm and keep adding another dimm one by one. When the problem comes back. Remove all dimms again and put the last dimm you added where the problem came back in first slot. If the problem persists, you found the winner. If not, add all dimms back except the one you just tested and use another dimm. If problem persists, you found a bad memory slot.

It's a real PITA <tm> but that is the only way to find the issue if it is indeed memory or a bad memory slot.

One more thing you should try. Did you enable IPMI? If so, #ipmitool -H x.x.x.x sel list

Take a look at the output. If you did not enable IPMI (ipadd/netmask/gateway), the bios should have a place to do so. Sorry, we don't sell/build supermicro* so I am unfamiliar with those boards.

If you are using both kingston/crucial, just use one of those, do not mix them.

Hope this can help you out.

Lanny
Servaris Corporation
http://www.servaris.com


On 10/16/2012 3:48 PM, nate keegan wrote:
I'm only seeing gstat output of a few percentage points for the OS disks.

I am using ECC memory (both the Kingston and the new Crucial memory)
and went ahead and swapped out the SSD for SATA disks this morning.

Since both SSD were the same firmware and type/manufacturer I figured
it was a good time to address this variable.

I also went ahead and put in a serial console server this morning so I
have proper console access instead of relying on the Supermicro iLO
utility.

Will keep an eye on the pure SATA setup to see if it barfs or not.
Will try to gather some ddb(4) information if it does barf again.


On Mon, Oct 15, 2012 at 1:32 PM, Dieter BSD <[email protected]> wrote:
SSD are connected to on-board SATA port on motherboard

Presumably to controllers provided by the Intel Tylersburg 5520 chipset.

This system was commissioned in February of 2012 and ran without issue
as a ZFS backup system on our network until about 3 weeks ago.

The system is dual PSU behind a UPS so I don't think that this is an issue.

No changes? e.g. no added hardware to increase power load.
Overloading the power supply and/or the wiring (with too many splitters)
can result in flaky problems like this.

OS will respond to ping requests after the issue and if you have an
active SSH session you will remain connected to the system until you
attempt to do something like 'ls', 'ps', etc.

I am not able to drop into DDB when the issue happens as the system is
locked up completely. Could be a failure on my part to
understand/engage in how to do this, will try if the issue happens
again (should on Wednesday AM unless setting camcontrol apm to off for
the disks somehow fixes the issue).

If the system is alive enough to respond to ping, I'd expect you
should be able to get into DDB? Can you get into DDB when the system
is working normally?

2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot)
2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap

I ran the Crucial firmware update ISO and it did not see any firmware
updates as necessary on the SSD disks.

Does the problem happen with both the Crucial and the Intel SSDs?

If software I agree that it would not make sense that this would
suddenly pop-up after months of operation with no issues.

If something causes the software/firmware to take a different
path, new issues can appear. E.g. error handling or even timing.
Infrequently used code paths might not have been tested sufficiently.

Does the controller have firmware? Part of the BIOS I suppose.
Is there a BIOS update available? Have you considered connecting the
SSDs to a different controller?

the on-board AHCI portion of the BIOS does
not always see the disks after the event without a hard system power
reset.

That's at least one bug somewhere, probably the hardware isn't getting reset
properly. Does Supermicro know about this bug?

I have 48 Gb of Crucial memory that I will put in this system today to
replace the 24 Gb or so of Kingston memory I have in the system.

Which in addition to being different memory, should reduce swap activity.

Suggestion: move everything to conventional drives. Keep at least one
SSD connected to system, but normally unused. Now you can beat on the
SSD in a controlled manner to debug the problem. Does reading trigger
the problem? Writing? Try dd with different blocksizes, accessing
multiple SSDs at once, etc. I have to wonder if there is a timing problem,
or missing interrupt, or...

* Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended
purpose of this system

If it fails with FreeBSD but works with Solaris on the same hardware,
then it is almost certainly a problem with the device driver. (Or
at least a problem that Solaris has a workaround for.)
_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "[email protected]"
_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "[email protected]"

_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "[email protected]"

Reply via email to