On Sun, Apr 11, 2010 at 7:07 AM, Robert E. Seastrom <r...@seastrom.com> wrote:
> We've seen great increases in CPU and memory speeds as well as disk > densities since the last maximum (March 2000). Speccing ECC memory is > a reasonable start, but this sort of thing has been a problem in the > past (anyone remember the Sun UltraSPARC CPUs that had problems last > time around?) and will no doubt bite us again. > Sun's problem had an easy solution - and it's exactly the one you've mentioned - ECC. The issue with the UltraSPARC II's was that they had enough redundancy to detect a problem (Parity), but not enough to correct the problem (ECC). They also (initially) had a very abrupt handling of such errors - they would basically panic and restart. >From the UltraSPARC III's they fixed this problem by sticking with Parity in the L1 cache (write-through, so if you get a parity error you can just dump the cache and re-read from memory or a higher cache), but using ECC on the L2 and higher (write-back) caches. The memory and all datapaths were already protected with ECC in everything but the low-end systems. It does raise a very interesting question though - how many systems are you running that don't use ECC _everywhere_? (CPU, memory and datapath) Unlike many years ago, today Parity memory is basically non-existent, which means if you're not using ECC then you're probably suffering relatively regular single-bit errors without knowing it. In network devices that's less of an issue as you can normally rely on higher-level protocols to detect/correct the errors, but if you're not using ECC in your servers then you're asking for (silent) trouble... Scott.