On 05/04/2011 08:31 PM, David Boreham wrote:
Here's my best theory at present : the failures ARE caused by cell wear-out, but the SSD firmware is buggy in so far as it fails to boot up and respond to host commands due to the wear-out state. So rather than the expected outcome (SSD responds but has read-only behavior), it appears to be (and is) dead. At least to my mind, this is a more plausible explanation for the reported failures vs. the alternative (SSD vendors are uniquely clueless at making basic electronics subassemblies), especially considering the difficulty in testing the firmware under all possible wear-out conditions.

One question worth asking is : in the cases you were involved in, was manufacturer failure analysis performed (and if so what was the failure cause reported?).

Unfortunately not. Many of the people I deal with, particularly the ones with budgets to be early SSD adopters, are not the sort to return things that have failed to the vendor. In some of these shops, if the data can't be securely erased first, it doesn't leave the place. The idea that some trivial fix at the hardware level might bring the drive back to life, data intact, is terrifying to many businesses when drives fail hard.

Your bigger point, that this could just easily be software failures due to unexpected corner cases rather than hardware issues, is both a fair one to raise and even more scary.

Intel claims their Annual Failure Rate (AFR) on their SSDs in IT deployments (not OEM ones) is 0.6%. Typical measured AFR rates for mechanical drives is around 2% during their first year, spiking to 5% afterwards. I suspect that Intel's numbers are actually much better than the other manufacturers here, so a SSD from anyone else can easily be less reliable than a regular hard drive still.

Hmm, this is speculation I don't support (non-intel vendors have a 10x worse early failure rate). The entire industry uses very similar processes (often the same factories). One rogue vendor with a bad process...sure, but all of them ??


I was postulating that you only have to be 4X as bad as Intel to reach 2.4%, and then be worse than a mechanical drive for early failures. If you look at http://labs.google.com/papers/disk_failures.pdf you can see there's a 5:1 ratio in first-year AFR just between light and heavy usage on the drive. So a 4:1 ratio between best and worst manufacturer for SSD seemed possible. Plenty of us have seen particular drive models that were much more than 4X as bad as average ones among regular hard drives.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to