* Borislav Petkov <b...@alien8.de> wrote: > On Fri, Jul 08, 2016 at 11:46:53AM +0200, Ingo Molnar wrote: > > I'm not sure I can parse that: how can a reported error have bits corrupted? > > No, it is about the actual bits in memory the ECC error is generated > for. So, for example, if an ECC error reports that memory location X had > some bit flips, the syndrome value which gets reported together with > same ECC error shows which actual bits have flipped. > > Here's an example from the AMD BKDG, maybe that'll make it more clear: > > http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf > > Go to page 246, there it says this: > > "For example, assume the ECC syndrome is 03EAh. First search row EAh > for the complete syndrome. Since it is not found, search row 03h for > the complete syndrome. It is found in column 9h, so symbol 9h has the > error. Since the error bitmask indicates value 3h (0011b), bits 0 and 1 > within that symbol are corrupted. Symbol 9h maps to bits 72-79, so the > corrupted bits are 72 and 73 of the line." > > So you basically search the table of x8 ECC correctable syndromes, first > in row EAh (second syndrome byte) and if you don't find the complete > syndrome there, you search row 03 for it. > > It is in column 9 and that means symbol 9. The symbols are 16 - one > symbol for each byte in a 128bit DRAM word + 3 special symbols for the > ECC bits. > > The row number 3h is also the error bitmask, so bits 0 and 1 are the > ones which are corrupted. > > Which means, when you look at the value in DRAM at the address the error > was reported, you need to go to symbol 9, that's 9*8 = 72 which means, > bits 72-79 and the first 2 in that byte are bits 72 and 73. > > So if you want to correct them, you simply flip them as the syndrome > tells you that those 2 are corrupted. > > Ok?
So is 'ECC syndrome' a fancy word and a complicated process for identifying what data got corrupted, in a more accurate fashion than what we had before? Because previously we already had a memory address of the memory corruption, right? What is the typical 'scope' of that memory corruption address - a cache line, a machine word, a byte or maybe a variable unit that is memory hardware dependent? Thanks, Ingo