After installing Wheezy (using FAI, so the setup is essentially unaltered), one of my machines doesn't report memory errors via mcelog anymore. Error messages go to syslog instead:
> Jun 3 09:47:07 testbed kernel: [231899.816038] [Hardware Error]: CPU:0 > MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400000000833 > Jun 3 09:47:07 testbed kernel: [231899.816282] [Hardware Error]: > MC0_ADDR: 0x0000000076d39ec0 > Jun 3 09:47:07 testbed kernel: [231899.816377] [Hardware Error]: Data Cache > Error: during system linefill. > Jun 3 09:47:07 testbed kernel: [231899.816534] [Hardware Error]: cache > level: L3/GEN, mem/io: MEM, mem-tx: DRD, part-proc: SRC (no timeout) > Jun 3 09:47:07 testbed kernel: [231899.816899] [Hardware Error]: CPU:0 > MC2_STATUS[Over|CE|-|-|-|CECC]: 0xd000400000000863 > Jun 3 09:47:07 testbed kernel: [231899.817136] [Hardware Error]: Bus Unit > Error: PRF/ECC error in data read from NB: SRC. > Jun 3 09:47:07 testbed kernel: [231899.817314] [Hardware Error]: cache > level: L3/GEN, mem/io: MEM, mem-tx: PRF, part-proc: SRC (no timeout) > Jun 3 09:47:07 testbed kernel: [231899.817677] [Hardware Error]: CPU:0 > MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400200000813 > Jun 3 09:47:07 testbed kernel: [231899.817915] [Hardware Error]: > MC4_ADDR: 0x000000007fafc410 > Jun 3 09:47:07 testbed kernel: [231899.818009] [Hardware Error]: Northbridge > Error (node 0): DRAM ECC error detected on the NB. > Jun 3 09:47:07 testbed kernel: [231899.818189] EDAC amd64 MC0: CE > ERROR_ADDRESS= 0x7fafc410 > Jun 3 09:47:07 testbed kernel: [231899.818289] EDAC MC0: CE page 0x7fafc, > offset 0x410, grain 0, syndrome 0xce, row 1, channel 0, label "": amd64_edac > Jun 3 09:47:07 testbed kernel: [231899.818298] [Hardware Error]: cache > level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout) > Jun 3 09:47:08 testbed kernel: [231900.804029] [Hardware Error]: CPU:1 > MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400000000833 > Jun 3 09:47:08 testbed kernel: [231900.804278] [Hardware Error]: > MC0_ADDR: 0x000000007a673600 > Jun 3 09:47:08 testbed kernel: [231900.804371] [Hardware Error]: Data Cache > Error: during system linefill. > Jun 3 09:47:08 testbed kernel: [231900.804530] [Hardware Error]: cache > level: L3/GEN, mem/io: MEM, mem-tx: DRD, part-proc: SRC (no timeout) > Jun 3 09:47:08 testbed kernel: [231900.804894] [Hardware Error]: CPU:1 > MC2_STATUS[Over|CE|-|-|-|CECC]: 0xd000400000000863 > Jun 3 09:47:08 testbed kernel: [231900.805130] [Hardware Error]: Bus Unit > Error: PRF/ECC error in data read from NB: SRC. > Jun 3 09:47:08 testbed kernel: [231900.810632] [Hardware Error]: cache > level: L3/GEN, mem/io: MEM, mem-tx: PRF, part-proc: SRC (no timeout) > Jun 3 09:52:07 testbed kernel: [232199.816039] [Hardware Error]: CPU:0 > MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400000000833 > Jun 3 09:52:07 testbed kernel: [232199.816284] [Hardware Error]: > MC0_ADDR: 0x00000021086ea0c0 > Jun 3 09:52:07 testbed kernel: [232199.816378] [Hardware Error]: Data Cache > Error: during system linefill. > Jun 3 09:52:07 testbed kernel: [232199.816536] [Hardware Error]: cache > level: L3/GEN, mem/io: MEM, mem-tx: DRD, part-proc: SRC (no timeout) > Jun 3 09:52:07 testbed kernel: [232199.816901] [Hardware Error]: CPU:0 > MC2_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd400400000000813 > Jun 3 09:52:07 testbed kernel: [232199.817139] [Hardware Error]: > MC2_ADDR: 0x0000000077ef0cc0 > Jun 3 09:52:07 testbed kernel: [232199.817232] [Hardware Error]: Bus Unit > Error: RD/ECC error in data read from NB: SRC. > Jun 3 09:52:07 testbed kernel: [232199.817409] [Hardware Error]: cache > level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout) > Jun 3 09:52:07 testbed kernel: [232199.817771] [Hardware Error]: CPU:0 > MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400200000813 > Jun 3 09:52:07 testbed kernel: [232199.818008] [Hardware Error]: > MC4_ADDR: 0x000000007fafc410 > Jun 3 09:52:07 testbed kernel: [232199.818101] [Hardware Error]: Northbridge > Error (node 0): DRAM ECC error detected on the NB. > Jun 3 09:52:07 testbed kernel: [232199.818282] EDAC amd64 MC0: CE > ERROR_ADDRESS= 0x7fafc410 > Jun 3 09:52:07 testbed kernel: [232199.818382] EDAC MC0: CE page 0x7fafc, > offset 0x410, grain 0, syndrome 0xce, row 1, channel 0, label "": amd64_edac > Jun 3 09:52:07 testbed kernel: [232199.818391] [Hardware Error]: cache > level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout) > Jun 3 09:52:08 testbed kernel: [232200.804035] [Hardware Error]: CPU:1 > MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd467400000000833 > Jun 3 09:52:08 testbed kernel: [232200.804283] [Hardware Error]: > MC0_ADDR: 0x000000007a673600 > Jun 3 09:52:08 testbed kernel: [232200.804377] [Hardware Error]: Data Cache > Error: during system linefill. > Jun 3 09:52:08 testbed kernel: [232200.804534] [Hardware Error]: cache > level: L3/GEN, mem/io: MEM, mem-tx: DRD, part-proc: SRC (no timeout) > Jun 3 09:52:08 testbed kernel: [232200.804899] [Hardware Error]: CPU:1 > MC2_STATUS[Over|CE|-|-|-|CECC]: 0xd000400000000863 > Jun 3 09:52:08 testbed kernel: [232200.805136] [Hardware Error]: Bus Unit > Error: PRF/ECC error in data read from NB: SRC. > Jun 3 09:52:08 testbed kernel: [232200.805312] [Hardware Error]: cache > level: L3/GEN, mem/io: MEM, mem-tx: PRF, part-proc: SRC (no timeout) mcelog setup hasn't changed, actually /etc/mcelog/* is identical to a Squeeze setup that works. Its logfile just stays at zero size. I find it a bit hard to spot important information in the syslog records, in particular whether an ECC error has been corrected or not (and when to take action -> power off the node) Obviously I have missed an important change (perhaps related to edac_* modules??), how can I get back to mcelog? Thanks, S -- To UNSUBSCRIBE, email to debian-amd64-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130603081154.gy20...@casco.aei.mpg.de