On Thu, Jan 30, 2014 at 09:34:28AM -0500, Mike Julian wrote:
> We constantly find memory, network, and hard drive failures very early on
> after turning over to a customer. Sometimes as soon as we start to deploy
> an OS/software to them. It happens *quite* often.

TL;DR: As you mentioned, changing suppliers is a great option.

I spent years in reliability engineering. My hunch is ESD
damage, which often causes latent defects in addition to
outright dead parts. Test your own ESD equipment. I recently
found a "grounded floor" where the ground lead went no further
than a coil of wire in the ceiling.

The most effective fix is to insist that each part of the supply
chain take responsibility. Who tested it? What did they test?
Where are the test logs? Was it unsealed when they handled it?
Was it handled in an ESD-protected shop? How was it stored? Make
them refer you to the previous stop in the supply chain. On-site
inspection of assembly facilities. You're not done until the
manufacturer of the failed part sends a lab report with
explanation of what failed and why. I've uncovered issues such
as night-shift employees who moved parts to the tested bin
without any test at all (fired), to circuit design issues (setup
time violations in disk controller), ESD handing issues, 5-foot
drops during shipping, lightning hits in service, you name it.

A less effective (but more likely to be practical) method is to
attempt to "test quality in" once you receive the product.  
-> You can't test quality into a product <-, but you can weed
out some detectable failures. In no particular order, you might
try:

- Run continuous full kernel compiles (easy to set up, very
  likely to catch CPU/RAM issues. 24 hours should catch anything
  you're likely to catch using this method.

- Run the drive conveyance test, once, on each hard drive (man
  smartctl).

- Run hard drive "long" tests on each hard drive, for about 24
  hours (man smartctl).

- Run random R/W tests (lots of seeks) on the hard drives.

- Get a variac, run kernel compiles at lowest and highest spec'd
  line voltages.

- Find a way to run kernel compiles at the high-end of the
  spec'd ambient temperature.

Good luck,
-- 
Charles

_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to