Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-04-20 Thread Vladimir Budnev
On 03/22/11 19:00, m.r...@5-cent.us wrote:
 Vladimir Budnev wrote:

 2011/3/22m.r...@5-cent.us
  
 CHOMP

 So with 2 4-core Xeons, I don't understand how you can get 3x and 5x.
 Could you post some raw messages, either from /var/log/message or
 from /var/log/mcelog?


 sure here they are before night party:
 MCE 24
 CPU 52 BANK 8 TSC 372a290717a
 MISC 68651f81186 ADDR 7dd2ad840
 STATUS cc000281009f MCGSTATUS 0
 MCE 25
  
 snip
 At this point, I throw up my hands. I have *no* idea how they could get
 numbers like CPU 52, unless something's wrong in the o/s - I mean, you
 are running 64 bit, right?

 Yeah, x86_64
 I have an idea dunnothe thing is we r runngin 4.8 centos. Its old
 enough and mcelog version is old enough also, mb it decodes something
  
 completely

 wrong.
  
 It could be that 4.8 doesn't really understand the CPU.


 Anyway thanks so much for your time and answers. Hope we will find those
 dimms in experiments.
  
 Seriously - how old is this? I think you should call your vendor: some
 will give you phone or email support, even after the end of warranty.

   mark



Forgot to write our solution, mb it will be usefull for someone. In our 
case the problem was(as expected) in DIMM modules. After replacing no 
more scare mcelogs e.t.c.

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread Vladimir Budnev
2011/3/21 m.r...@5-cent.us

 Vladimir Budnev wrote:
  Hello community.
 
  We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon
  E5630 and 8xKingston KVR1333D3D4R9S/4G
 
  For some time we have lots of MCE in mcelog and we cant find out the
  reason.

 The only thing that shows there (when it shows, since sometimes it doesn't
 seem to) is a hardware error. You *WILL* be replacing hardware, sometime
 soon, like yesterday.

 Normal is not: *ANYTHING* here is Bad News. First, you've got DIMMs
 failing.  CPU 53, assuming this system doesn't have 53+ physical CPUs,
 means that you have x-core systems, so you need to divide by x, so that if
 it's a 12-core system with 6 physical chips, that would make it DIMM 8
 associated with that physical CPU.
 snip
  One more interesting thins is the following output:
  [root@zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq
  32
  33
  34
  35
  50
  51
  52
  53
 
  Those numbers are always the same.

 Bad news: you have *two* DIMMs failing, one associated with the physical
 CPU that has core 53, and another associated with the physical CPU that
 has cores 32-35.

 Talk to your OEM support to help identify which banks need replacing,
 and/or find a motherboard diagram.

  mark, who has to deal *again* with one machine with the same
 problem


Tnx for the asnwer!

Last night we'v made some research to find out which RAM modules bugged.

To be noticed we have 8 modules 4G each.

First  we'v removed a3,b1 slots for each cpu, and there were no changes in
HW behaviour. Errors appeared after boot.

Then we'v removed a1,a2 (yes i know that for hight performance we should
place modules starting from a1 but it was our mistake and in any case server
started) and ...and there were no errors during 1h. Usually we can observer
errors coming ~every 5 mins.

Then we'v placed back 2 modules. At that step we had a1,a3,b1 slots occupied
for each cpu. No errors.

Finally we'v placed last 2 modules...and no errors. It should be noticed
that at that step we have exactly the same modules placement as before
experiment.

Sounds strange, but at first glance looks like smthg was wrong with modules
placement. But we cant realise why the problem didnt show for the first
days, even month of server running. Noone touched server HW, so i have no
idea what was that.

Now we are just waiting will there be errors again.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread Nico Kadel-Garcia
On Tue, Mar 22, 2011 at 7:33 AM, Vladimir Budnev
vladimir.bud...@gmail.com wrote:


 2011/3/21 m.r...@5-cent.us

 Vladimir Budnev wrote:
  Hello community.
 
  We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon
  E5630 and 8xKingston KVR1333D3D4R9S/4G
 
  For some time we have lots of MCE in mcelog and we cant find out the
  reason.

 The only thing that shows there (when it shows, since sometimes it doesn't
 seem to) is a hardware error. You *WILL* be replacing hardware, sometime
 soon, like yesterday.

 Normal is not: *ANYTHING* here is Bad News. First, you've got DIMMs
 failing.  CPU 53, assuming this system doesn't have 53+ physical CPUs,
 means that you have x-core systems, so you need to divide by x, so that if
 it's a 12-core system with 6 physical chips, that would make it DIMM 8
 associated with that physical CPU.
 snip
  One more interesting thins is the following output:
  [root@zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq
  32
  33
  34
  35
  50
  51
  52
  53
 
  Those numbers are always the same.

 Bad news: you have *two* DIMMs failing, one associated with the physical
 CPU that has core 53, and another associated with the physical CPU that
 has cores 32-35.

 Talk to your OEM support to help identify which banks need replacing,
 and/or find a motherboard diagram.

          mark, who has to deal *again* with one machine with the same
 problem

 Tnx for the asnwer!

 Last night we'v made some research to find out which RAM modules bugged.

 To be noticed we have 8 modules 4G each.

 First  we'v removed a3,b1 slots for each cpu, and there were no changes in
 HW behaviour. Errors appeared after boot.

 Then we'v removed a1,a2 (yes i know that for hight performance we should
 place modules starting from a1 but it was our mistake and in any case server
 started) and ...and there were no errors during 1h. Usually we can observer
 errors coming ~every 5 mins.

 Then we'v placed back 2 modules. At that step we had a1,a3,b1 slots occupied
 for each cpu. No errors.

 Finally we'v placed last 2 modules...and no errors. It should be noticed
 that at that step we have exactly the same modules placement as before
 experiment.

 Sounds strange, but at first glance looks like smthg was wrong with modules
 placement. But we cant realise why the problem didnt show for the first
 days, even month of server running. Noone touched server HW, so i have no
 idea what was that.

 Now we are just waiting will there be errors again.

You know..

I once had a *whole rack* of blade servers, running CentOS, where
someone decided to save money by buying the memory separately and
replacing it in-house. Slews of memory errors started up pretty soon.
and I wound up having to reseat all of it, run some memory testing
tools against them, juggle the good memory with the bad memory to get
working systems, replace DIMM's, etc., etc. We kept seeing failures
over the next few months as part of the falling part of a bathtub
curve.

I was furious that we'd saved perhaps 2 thousand bucks on RAM,
overall, and completely burned a month of my time and made our clients
*VERY* unhappy and come out looking like fools for not having this
very expensive piece of kit working from day one.

In the process, though, some of the systems were repaired
permanently by simply reseating the RAM. I did handle them
carefully, cleaning the filters, removing any dust (of which there was
very little, they were new) and checking all the cabling. I also
cleaned up the airflow a bit by doing some recabling and relabeling,
normal practice when I have a rack down and a chance to make sure
things go where they shouuld.

And I *carefully* cleaned up the blood where I cut my hand on the heat
sink on the one system. Maybe it was the blood sacrifice that appeased
the gods on that server?
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread m . roth
Vladimir Budnev wrote:
 2011/3/21 m.r...@5-cent.us
 Vladimir Budnev wrote:
  Hello community.
 
  We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel
  Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
 
  For some time we have lots of MCE in mcelog and we cant find out the
  reason.

 The only thing that shows there (when it shows, since sometimes it
 doesn't seem to) is a hardware error. You *WILL* be replacing hardware,
sometime
 soon, like yesterday.
snip
 Bad news: you have *two* DIMMs failing, one associated with the physical
 CPU that has core 53, and another associated with the physical CPU that
 has cores 32-35.
snip
 Last night we'v made some research to find out which RAM modules bugged.

 To be noticed we have 8 modules 4G each.
snip
 Finally we'v placed last 2 modules...and no errors. It should be noticed
 that at that step we have exactly the same modules placement as before
 experiment.

 Sounds strange, but at first glance looks like smthg was wrong with
 modules placement. But we cant realise why the problem didnt show for
the first
 days, even month of server running. Noone touched server HW, so i have no
 idea what was that.

 Now we are just waiting will there be errors again.

I'm sure there will. Reseating the memory may have done something, but
there will, I'll wager.

Here's a question out of left field: who was the manufacturer of the 4G
DIMMs? Not Supermicro, but the DIMMs themselves?

mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread Vladimir Budnev
2011/3/22 m.r...@5-cent.us

 Vladimir Budnev wrote:
  2011/3/21 m.r...@5-cent.us
  Vladimir Budnev wrote:
   Hello community.
  
   We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel
   Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
  
   For some time we have lots of MCE in mcelog and we cant find out the
   reason.
 
  The only thing that shows there (when it shows, since sometimes it
  doesn't seem to) is a hardware error. You *WILL* be replacing hardware,
 sometime
  soon, like yesterday.
 snip
  Bad news: you have *two* DIMMs failing, one associated with the physical
  CPU that has core 53, and another associated with the physical CPU that
  has cores 32-35.
 snip
  Last night we'v made some research to find out which RAM modules bugged.
 
  To be noticed we have 8 modules 4G each.
 snip
  Finally we'v placed last 2 modules...and no errors. It should be noticed
  that at that step we have exactly the same modules placement as before
  experiment.
 
  Sounds strange, but at first glance looks like smthg was wrong with
  modules placement. But we cant realise why the problem didnt show for
 the first
  days, even month of server running. Noone touched server HW, so i have no
  idea what was that.
 
  Now we are just waiting will there be errors again.

 I'm sure there will. Reseating the memory may have done something, but
 there will, I'll wager.


mark, you are absolutely right :) Approximetely 1h ago errors appeared. They
appeared only once since reboot, but they r back. Hi there :(

The good idea is that CPU numbers changed, so now we have cpu 1,2,3 and
18,19,20,21.We definetely moved broken modules to another slots.
Anyway bad dimm is really a good news for us instead of e.g.  motherboard.

We are going to continue party this night or tomorrow morning, and determin
which two modules are broken.

Is it possible to determine which physical dimms correspond to those cpus
noticed in mce messagees? We have two rows of slots(6 slot for each row) one
for cpu1 and second for cpu2. Used slots marked as
cpu1-a1,cpu1-a2,cpu1-a3,cpu1-b1 and cpu2-a1,cpu2-a2,cpu2-a3,cpu2-b1.

I remeber that you adviced to divide cpu number on physical core count. We
have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or depends on
situation? I hope we will find those bustards ourselvs but hint would be
great.

And one more thing i cant funderstand ... if there is,say, 8 cpu numbers
per each memory module(in our situation), why we see only 4 numbers and not
8 e.g. 0,1,2,3,4,5,6,7 ?


 Here's a question out of left field: who was the manufacturer of the 4G
 DIMMs? Not Supermicro, but the DIMMs themselves?


This is Kingston KVR1333D3D4R9S/4G if i got the question
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread m . roth
Vladimir Budnev wrote:
 2011/3/22 m.r...@5-cent.us
 Vladimir Budnev wrote:
  2011/3/21 m.r...@5-cent.us
  Vladimir Budnev wrote:
  
   We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel
   Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
  
   For some time we have lots of MCE in mcelog and we cant find out
   the reason.
 
  The only thing that shows there (when it shows, since sometimes it
  doesn't seem to) is a hardware error. You *WILL* be replacing
  hardware, sometime soon, like yesterday.
 snip
  Bad news: you have *two* DIMMs failing, one associated with the
  physical CPU that has core 53, and another associated with the
physical CPU
  that has cores 32-35.
 snip, memory reseating
  Now we are just waiting will there be errors again.

 I'm sure there will. Reseating the memory may have done something, but
 there will, I'll wager.

 mark, you are absolutely right :) Approximetely 1h ago errors appeared.
 They appeared only once since reboot, but they r back. Hi there :(

 The good idea is that CPU numbers changed, so now we have cpu 1,2,3 and
 18,19,20,21.We definetely moved broken modules to another slots.
 Anyway bad dimm is really a good news for us instead of e.g.  motherboard.
snip
 Is it possible to determine which physical dimms correspond to those cpus
 noticed in mce messagees? We have two rows of slots(6 slot for each row)
 one for cpu1 and second for cpu2. Used slots marked as
 cpu1-a1,cpu1-a2,cpu1-a3,cpu1-b1 and cpu2-a1,cpu2-a2,cpu2-a3,cpu2-b1.

 I remeber that you adviced to divide cpu number on physical core count. We
 have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or depends on
 situation? I hope we will find those bustards ourselvs but hint would be
 great.

 And one more thing i cant funderstand ... if there is,say, 8 cpu numbers
 per each memory module(in our situation), why we see only 4 numbers and
 not 8 e.g. 0,1,2,3,4,5,6,7 ?

I'm now confused about a lot: originally, you mentioned 53 - 57, was it?
That doesn't add up, since you say you have 2 quad core processors, for a
total of 8 cpus, and each of those processors have 6 banks, which would
mean each processor should only see six (directly). Where I'm confused is
how you could have cores 32-35, or 53-whatsit, when you only have 8 cores
in two processors.

 Here's a question out of left field: who was the manufacturer of the 4G
 DIMMs? Not Supermicro, but the DIMMs themselves?

 This is Kingston KVR1333D3D4R9S/4G if i got the question

Oh, ok. I was wondering if they were Hynix - I've seen a good number of
bad 4G and 8G DIMMs from them recently, and that across three different
OEMs and model DIMMs.

 mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread Vladimir Budnev
2011/3/22 m.r...@5-cent.us

 Vladimir Budnev wrote:
  2011/3/22 m.r...@5-cent.us
  Vladimir Budnev wrote:
   2011/3/21 m.r...@5-cent.us
   Vladimir Budnev wrote:
   
We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel
Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
   
For some time we have lots of MCE in mcelog and we cant find out
the reason.
  
   The only thing that shows there (when it shows, since sometimes it
   doesn't seem to) is a hardware error. You *WILL* be replacing
   hardware, sometime soon, like yesterday.
  snip
   Bad news: you have *two* DIMMs failing, one associated with the
   physical CPU that has core 53, and another associated with the
 physical CPU
   that has cores 32-35.
  snip, memory reseating
   Now we are just waiting will there be errors again.
 
  I'm sure there will. Reseating the memory may have done something, but
  there will, I'll wager.
 
  mark, you are absolutely right :) Approximetely 1h ago errors appeared.
  They appeared only once since reboot, but they r back. Hi there :(
 
  The good idea is that CPU numbers changed, so now we have cpu 1,2,3 and
  18,19,20,21.We definetely moved broken modules to another slots.
  Anyway bad dimm is really a good news for us instead of e.g.
  motherboard.
 snip
  Is it possible to determine which physical dimms correspond to those cpus
  noticed in mce messagees? We have two rows of slots(6 slot for each row)
  one for cpu1 and second for cpu2. Used slots marked as
  cpu1-a1,cpu1-a2,cpu1-a3,cpu1-b1 and cpu2-a1,cpu2-a2,cpu2-a3,cpu2-b1.
 
  I remeber that you adviced to divide cpu number on physical core count.
 We
  have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or depends on
  situation? I hope we will find those bustards ourselvs but hint would be
  great.
 
  And one more thing i cant funderstand ... if there is,say, 8 cpu
 numbers
  per each memory module(in our situation), why we see only 4 numbers and
  not 8 e.g. 0,1,2,3,4,5,6,7 ?

 I'm now confused about a lot: originally, you mentioned 53 - 57, was it?
 That doesn't add up, since you say you have 2 quad core processors, for a
 total of 8 cpus, and each of those processors have 6 banks, which would
 mean each processor should only see six (directly). Where I'm confused is
 how you could have cores 32-35, or 53-whatsit, when you only have 8 cores
 in two processors.


 2 cpu each 8 cores and HT support. So 16 at max i think. for such way is it
ok?
 I really lost the idea line with those cpu to memory bank mappings...


  Here's a question out of left field: who was the manufacturer of the 4G
  DIMMs? Not Supermicro, but the DIMMs themselves?
 
  This is Kingston KVR1333D3D4R9S/4G if i got the question

 Oh, ok. I was wondering if they were Hynix - I've seen a good number of
 bad 4G and 8G DIMMs from them recently, and that across three different
 OEMs and model DIMMs.

 mark

 ___
 CentOS mailing list
 CentOS@centos.org
 http://lists.centos.org/mailman/listinfo/centos

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread m . roth
Vladimir Budnev wrote:
 2011/3/22 m.r...@5-cent.us
 Vladimir Budnev wrote:
  2011/3/22 m.r...@5-cent.us
  Vladimir Budnev wrote:
   2011/3/21 m.r...@5-cent.us
   Vladimir Budnev wrote:
   
We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with
2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
   
For some time we have lots of MCE in mcelog and we cant find out
the reason.
  
   The only thing that shows there (when it shows, since sometimes it
   doesn't seem to) is a hardware error. You *WILL* be replacing
   hardware, sometime soon, like yesterday.
  snip
  We have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or
depends on
  situation? I hope we will find those bustards ourselvs but hint would
  be great.
 
  And one more thing i cant funderstand ... if there is,say, 8 cpu
  numbers per each memory module(in our situation), why we see only 4
numbers
  and not 8 e.g. 0,1,2,3,4,5,6,7 ?

 I'm now confused about a lot: originally, you mentioned 53 - 57, was it?
 That doesn't add up, since you say you have 2 quad core processors, for
 a total of 8 cpus, and each of those processors have 6 banks, which would
 mean each processor should only see six (directly). Where I'm confused
 is how you could have cores 32-35, or 53-whatsit, when you only have 8
 cores in two processors.

  2 cpu each 8 cores and HT support. So 16 at max i think. for such way is
 it  ok?

Huh? Above, you say 2 quad core proc - that's 8 cores over two processor
chips. HT support doesn't figure into it; if you use dmidecode or lshw, I
believe it will show you 8 cores, not 16.

  I really lost the idea line with those cpu to memory bank mappings...

Each processor will directly see the DIMMs associate with it, so that the
banks associated with each processor will be what directly affects the
cores. So, if you see something like
Mar 20 05:01:35 system name kernel:  Northbridge Error, node 0, core: 5
(these processors are 8-core), it means that one of the DIMMs in bank 0,
0-3, is bad.
You should see
   __
  |_0|  0 1 2 3
 __
|_1|  0 1 2 3

or whatever on the m/b, so one of the top ones there is affected. Is that
any clearer?

   mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread Vladimir Budnev
2011/3/22 m.r...@5-cent.us

 Vladimir Budnev wrote:
  2011/3/22 m.r...@5-cent.us
  Vladimir Budnev wrote:
   2011/3/22 m.r...@5-cent.us
   Vladimir Budnev wrote:
2011/3/21 m.r...@5-cent.us
Vladimir Budnev wrote:

 We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with
 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G

 For some time we have lots of MCE in mcelog and we cant find out
 the reason.
   
The only thing that shows there (when it shows, since sometimes it
doesn't seem to) is a hardware error. You *WILL* be replacing
hardware, sometime soon, like yesterday.
   snip
   We have 2 quad core proc, so 8 cpu. 1/8=0 Is it cpu-a1 slot or
 depends on
   situation? I hope we will find those bustards ourselvs but hint would
   be great.
  
   And one more thing i cant funderstand ... if there is,say, 8 cpu
   numbers per each memory module(in our situation), why we see only 4
 numbers
   and not 8 e.g. 0,1,2,3,4,5,6,7 ?
 
  I'm now confused about a lot: originally, you mentioned 53 - 57, was it?
  That doesn't add up, since you say you have 2 quad core processors, for
  a total of 8 cpus, and each of those processors have 6 banks, which
 would
  mean each processor should only see six (directly). Where I'm confused
  is how you could have cores 32-35, or 53-whatsit, when you only have 8
  cores in two processors.
 
   2 cpu each 8 cores and HT support. So 16 at max i think. for such way is
  it  ok?

 Huh? Above, you say 2 quad core proc - that's 8 cores over two processor
 chips. HT support doesn't figure into it; if you use dmidecode or lshw, I
 believe it will show you 8 cores, not 16.

Was a typo, sorry. 2 CPU and each one has 4 cores so totally 8 cores.


   I really lost the idea line with those cpu to memory bank mappings...

 Each processor will directly see the DIMMs associate with it, so that the
 banks associated with each processor will be what directly affects the
 cores. So, if you see something like
 Mar 20 05:01:35 system name kernel:  Northbridge Error, node 0, core: 5
 (these processors are 8-core), it means that one of the DIMMs in bank 0,
 0-3, is bad.
 You should see
   __
  |_0|  0 1 2 3
 __
|_1|  0 1 2 3

 or whatever on the m/b, so one of the top ones there is affected. Is that
 any clearer?

First of all big thnx for helping mark.

In your example everything is ok. But i am lost with what we have.
Previously we recieved messages like i post in the first mail:
CPU 51 BANK 8 TSC 8511e3ca77dc
MISC 274d587f6141 ADDR 807044840
STATUS cc005501009f MCGSTATU

And always there were same cpu numbers. I really dont know why do mcleog
show such numbers but thats what we have.Always Bank 8 and there were
32,33,34,45 and 50,51,52,53 numbers in CPU field.

You encouraged us that it is a dimm problem and we decide to make a little
research which i described up the thread. During that wev replaced DIMM
moduels between slots, so now we have BANK 8 and cpu 1,2,3 and 18,29,20,21.
It really seems that some how those numbers connected with RAM modules.

But... as i sad we have following slots
   CPU1cpu1-a1 cpu1-a2 cpu1-a3 cpu1-b1 cpu1-b2 cpu1-b3
   CPU2cpu2-a1 cpu2-a2 cpu2-a3 cpu2-b1 cpu2-b2 cpu2-b3

We have modules placed in such way:
++++++++
|  |  V | V  |  V |  V |
free|free|
++++++++
|   CPU1  |  cpu1-a1| cpu1-a2 | cpu1-a3 | cpu1-b1 | cpu1-b2| cpu1-b3 |
++++++++


++++++++
|  |  V | V  |  V |  V |
free|free|
++++++++
|   CPU2  |  cpu2-a1| cpu2-a2 | cpu2-a3 | cpu2-b1 | cpu1-b2| cpu1-b3 |
++++++++

Definetely there is something with memory banks,becasue replacinbg moudels
changed the mce messages, but what exactly...or iv interpreted all wrong?
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread Rafa Griman
Hi :)

On Tue, Mar 22, 2011 at 3:59 PM, Vladimir Budnev
vladimir.bud...@gmail.com wrote:

[...]

 But... as i sad we have following slots
    CPU1    cpu1-a1 cpu1-a2 cpu1-a3 cpu1-b1 cpu1-b2 cpu1-b3
    CPU2    cpu2-a1 cpu2-a2 cpu2-a3 cpu2-b1 cpu2-b2 cpu2-b3

 We have modules placed in such way:
 ++++++++
 |  |  V |     V  |  V |      V |
 free    |    free    |
 ++++++++
 |   CPU1  |  cpu1-a1| cpu1-a2 | cpu1-a3 | cpu1-b1 | cpu1-b2| cpu1-b3 |
 ++++++++


 ++++++++
 |  |  V |     V  |  V |      V |
 free    |    free    |
 ++++++++
 |   CPU2  |  cpu2-a1| cpu2-a2 | cpu2-a3 | cpu2-b1 | cpu1-b2| cpu1-b3 |
 ++++++++

 Definetely there is something with memory banks,becasue replacinbg moudels
 changed the mce messages, but what exactly...or iv interpreted all wrong?


This isn't an optimal setup (performance-wise). You should always
populate complete slots in multiples of 3 to get the full bandwidth.
In your case, you've got cpu1-b[2|3] and cpu2-b[2|3] with no DIMMs so
that would affect your performance.

HTH

   Rafa
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread m . roth
Vladimir Budnev wrote:
 2011/3/22 m.r...@5-cent.us
 Vladimir Budnev wrote:
  2011/3/22 m.r...@5-cent.us
  Vladimir Budnev wrote:
   2011/3/22 m.r...@5-cent.us
   Vladimir Budnev wrote:
2011/3/21 m.r...@5-cent.us
Vladimir Budnev wrote:

 We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with
 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G

The next thing you should do, if you don't have them, is go to
http://www.supermicro.com/support/manuals/ and d/l the manual, and see
what it says about DIMMs.

mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread Vladimir Budnev
2011/3/22 m.r...@5-cent.us

 Vladimir Budnev wrote:
  2011/3/22 m.r...@5-cent.us
  Vladimir Budnev wrote:
   2011/3/22 m.r...@5-cent.us
   Vladimir Budnev wrote:
2011/3/22 m.r...@5-cent.us
Vladimir Budnev wrote:
 2011/3/21 m.r...@5-cent.us
 Vladimir Budnev wrote:
 
  We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with
  2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
 
 The next thing you should do, if you don't have them, is go to
 http://www.supermicro.com/support/manuals/ and d/l the manual, and see
 what it says about DIMMs.


If you meaned to check whether those DIMM modules a compatible with mother
board , its ok. Kingstin KVR1333D3D4R9S is in tested list
http://www.supermicro.com/support/resources/memory/display.cfm?sz=4.0mspd=1.333mtyp=33id=89A8A9B9E45453813BB99586F1BAE93F

And can you say something about cpu wild numbers and determing which dimms
are bugged? didnt you mean some post ago that on x core system we must
divide cpu value on core numbers to get DIMM slot? e.g. CPU 32/8 cores -4
slot?

At that moment we'v removed 2 modules and monitoring for the result.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread m . roth
Vladimir Budnev wrote:
 2011/3/22 m.r...@5-cent.us

 Vladimir Budnev wrote:
  2011/3/22 m.r...@5-cent.us
  Vladimir Budnev wrote:
   2011/3/22 m.r...@5-cent.us
   Vladimir Budnev wrote:
2011/3/22 m.r...@5-cent.us
Vladimir Budnev wrote:
 2011/3/21 m.r...@5-cent.us
 Vladimir Budnev wrote:
 
  We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF
 with
  2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
 
 The next thing you should do, if you don't have them, is go to
 http://www.supermicro.com/support/manuals/ and d/l the manual, and see
 what it says about DIMMs.

 If you meaned to check whether those DIMM modules a compatible with mother
 board , its ok. Kingstin KVR1333D3D4R9S is in tested list
 http://www.supermicro.com/support/resources/memory/display.cfm?sz=4.0mspd=1.333mtyp=33id=89A8A9B9E45453813BB99586F1BAE93F

No, what you need to see is a) whether what you did was valid (for the
Supermicro m/b on the server I'm working on right now, the manual says the
a-banks must *ALWAYS* be populated...), and b) you might find some
troubleshooting info to help you identify which DIMMs are the problem.

 And can you say something about cpu wild numbers and determing which dimms
 are bugged? didnt you mean some post ago that on x core system we must
 divide cpu value on core numbers to get DIMM slot? e.g. CPU 32/8 cores -4
 slot?

Nope. From your original post:
   One more interesting thins is the following output:
  [root@zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq
  32
  33
  34
  35
  50
  51
  52
  53

So with 2 4-core Xeons, I don't understand how you can get 3x and 5x.
Could you post some raw messages, either from /var/log/message or from
/var/log/mcelog?

mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread Vladimir Budnev
2011/3/22 m.r...@5-cent.us

 Vladimir Budnev wrote:
  2011/3/22 m.r...@5-cent.us
 
  Vladimir Budnev wrote:
   2011/3/22 m.r...@5-cent.us
   Vladimir Budnev wrote:
2011/3/22 m.r...@5-cent.us
Vladimir Budnev wrote:
 2011/3/22 m.r...@5-cent.us
 Vladimir Budnev wrote:
  2011/3/21 m.r...@5-cent.us
  Vladimir Budnev wrote:
  
   We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF
  with
   2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
  
  The next thing you should do, if you don't have them, is go to
  http://www.supermicro.com/support/manuals/ and d/l the manual, and
 see
  what it says about DIMMs.
 
  If you meaned to check whether those DIMM modules a compatible with
 mother
  board , its ok. Kingstin KVR1333D3D4R9S is in tested list
 
 http://www.supermicro.com/support/resources/memory/display.cfm?sz=4.0mspd=1.333mtyp=33id=89A8A9B9E45453813BB99586F1BAE93F
 
 No, what you need to see is a) whether what you did was valid (for the
 Supermicro m/b on the server I'm working on right now, the manual says the
 a-banks must *ALWAYS* be populated...), and b) you might find some
 troubleshooting info to help you identify which DIMMs are the problem.


Roger that. Our bad :(


  And can you say something about cpu wild numbers and determing which
 dimms
  are bugged? didnt you mean some post ago that on x core system we must
  divide cpu value on core numbers to get DIMM slot? e.g. CPU 32/8 cores
 -4
  slot?

 Nope. From your original post:
One more interesting thins is the following output:
   [root@zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq
   32
   33
   34
   35
   50
   51
   52
   53

 So with 2 4-core Xeons, I don't understand how you can get 3x and 5x.
 Could you post some raw messages, either from /var/log/message or from
 /var/log/mcelog?


sure here they are before night party:
MCE 24
CPU 52 BANK 8 TSC 372a290717a
MISC 68651f81186 ADDR 7dd2ad840
STATUS cc000281009f MCGSTATUS 0
MCE 25
CPU 32 BANK 8 TSC 372a29073cb
MISC 68651f81186 ADDR 7dd2ad840
STATUS cc000281009f MCGSTATUS 0
MCE 26
CPU 50 BANK 8 TSC 372a29064ca
MISC 68651f81186 ADDR 7dd2ad840
STATUS cc000281009f MCGSTATUS 0
MCE 27
CPU 33 BANK 8 TSC 372a2907e5c
MISC 68651f81186 ADDR 7dd2ad840
STATUS cc000281009f MCGSTATUS 0
MCE 28
CPU 35 BANK 8 TSC 372a29088f1
MISC 68651f81186 ADDR 7dd2ad840
STATUS cc000281009f MCGSTATUS 0
MCE 29
CPU 53 BANK 8 TSC 372a2908e82
MISC 68651f81186 ADDR 7dd2ad840
STATUS cc000281009f MCGSTATUS 0
MCE 30
CPU 51 BANK 8 TSC 372a290899f
MISC 68651f81186 ADDR 7dd2ad840
STATUS cc000281009f MCGSTATUS 0
MCE 31
CPU 34 BANK 8 TSC 423243c7aa5
MISC 2275a96d098f ADDR 7e7540ac0
STATUS cc001f01009f MCGSTATUS 0


and here after:

MCE 0
CPU 18 BANK 8 TSC 608709adcc62
MISC c6673a041181 ADDR 2f4cf4f40
STATUS cc81009f MCGSTATUS 0
MCE 1
CPU 2 BANK 8 TSC 608709adcbcb
MISC c6673a041181 ADDR 2f4cf4f40
STATUS cc81009f MCGSTATUS 0
MCE 2
CPU 20 BANK 8 TSC 608709adcb59
MISC c6673a041181 ADDR 2f4cf4f40
STATUS cc81009f MCGSTATUS 0
MCE 3
CPU 1 BANK 8 TSC 608709add9b0
MISC c6673a041181 ADDR 2f4cf4f40
STATUS cc81009f MCGSTATUS 0
MCE 4
CPU 3 BANK 8 TSC 608709ade3ab
MISC c6673a041181 ADDR 2f4cf4f40
STATUS cc81009f MCGSTATUS 0
MCE 5
CPU 19 BANK 8 TSC 608709ade850
MISC c6673a041181 ADDR 2f4cf4f40
STATUS cc81009f MCGSTATUS 0
MCE 6
CPU 21 BANK 8 TSC 608709ade4ea
MISC c6673a041181 ADDR 2f4cf4f40
STATUS cc81009f MCGSTATUS 0
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread m . roth
Vladimir Budnev wrote:
 2011/3/22 m.r...@5-cent.us
 Vladimir Budnev wrote:
  2011/3/22 m.r...@5-cent.us
  Vladimir Budnev wrote:
   2011/3/22 m.r...@5-cent.us
   Vladimir Budnev wrote:
2011/3/22 m.r...@5-cent.us
Vladimir Budnev wrote:
 2011/3/22 m.r...@5-cent.us
 Vladimir Budnev wrote:
  2011/3/21 m.r...@5-cent.us
  Vladimir Budnev wrote:
  
   We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF
   with 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G
  
  The next thing you should do, if you don't have them, is go to
  http://www.supermicro.com/support/manuals/ and d/l the manual, and
  see what it says about DIMMs.
 
  If you meaned to check whether those DIMM modules a compatible with
  motherboard , its ok. Kingstin KVR1333D3D4R9S is in tested list
 
 http://www.supermicro.com/support/resources/memory/display.cfm?sz=4.0mspd=1.333mtyp=33id=89A8A9B9E45453813BB99586F1BAE93F
 
 No, what you need to see is a) whether what you did was valid (for the
 Supermicro m/b on the server I'm working on right now, the manual says
 the a-banks must *ALWAYS* be populated...), and b) you might find some
 troubleshooting info to help you identify which DIMMs are the problem.

 Roger that. Our bad :(

Std. sysadmin reply: RTFM! g

  And can you say something about cpu wild numbers and determing which
  dimms are bugged? didnt you mean some post ago that on x core system
we must
  divide cpu value on core numbers to get DIMM slot? e.g. CPU 32/8 cores
 -4 slot?
snip
 So with 2 4-core Xeons, I don't understand how you can get 3x and 5x.
 Could you post some raw messages, either from /var/log/message or from
 /var/log/mcelog?


 sure here they are before night party:
 MCE 24
 CPU 52 BANK 8 TSC 372a290717a
 MISC 68651f81186 ADDR 7dd2ad840
 STATUS cc000281009f MCGSTATUS 0
 MCE 25
snip
At this point, I throw up my hands. I have *no* idea how they could get
numbers like CPU 52, unless something's wrong in the o/s - I mean, you are
running 64 bit, right?

  mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread Vladimir Budnev
2011/3/22 m.r...@5-cent.us

 Vladimir Budnev wrote:
  2011/3/22 m.r...@5-cent.us
  Vladimir Budnev wrote:
   2011/3/22 m.r...@5-cent.us
   Vladimir Budnev wrote:
2011/3/22 m.r...@5-cent.us
Vladimir Budnev wrote:
 2011/3/22 m.r...@5-cent.us
 Vladimir Budnev wrote:
  2011/3/22 m.r...@5-cent.us
  Vladimir Budnev wrote:
   2011/3/21 m.r...@5-cent.us
   Vladimir Budnev wrote:
   
We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF
with 2xIntel Xeon E5630 and 8xKingston
 KVR1333D3D4R9S/4G
   
   The next thing you should do, if you don't have them, is go to
   http://www.supermicro.com/support/manuals/ and d/l the manual, and
   see what it says about DIMMs.
  
   If you meaned to check whether those DIMM modules a compatible with
   motherboard , its ok. Kingstin KVR1333D3D4R9S is in tested list
  
 
 http://www.supermicro.com/support/resources/memory/display.cfm?sz=4.0mspd=1.333mtyp=33id=89A8A9B9E45453813BB99586F1BAE93F
  
  No, what you need to see is a) whether what you did was valid (for the
  Supermicro m/b on the server I'm working on right now, the manual says
  the a-banks must *ALWAYS* be populated...), and b) you might find some
  troubleshooting info to help you identify which DIMMs are the problem.
 
  Roger that. Our bad :(

 Std. sysadmin reply: RTFM! g
 
   And can you say something about cpu wild numbers and determing which
   dimms are bugged? didnt you mean some post ago that on x core system
 we must
   divide cpu value on core numbers to get DIMM slot? e.g. CPU 32/8 cores
  -4 slot?
 snip
  So with 2 4-core Xeons, I don't understand how you can get 3x and 5x.
  Could you post some raw messages, either from /var/log/message or from
  /var/log/mcelog?
 
 
  sure here they are before night party:
  MCE 24
  CPU 52 BANK 8 TSC 372a290717a
  MISC 68651f81186 ADDR 7dd2ad840
  STATUS cc000281009f MCGSTATUS 0
  MCE 25
 snip
 At this point, I throw up my hands. I have *no* idea how they could get
 numbers like CPU 52, unless something's wrong in the o/s - I mean, you are
 running 64 bit, right?


Yeah, x86_64
I have an idea dunnothe thing is we r runngin 4.8 centos. Its old enough
and mcelog version is old enough also, mb it decodes something completely
wrong.
Anyway thanks so much for your time and answers. Hope we will find those
dimms in experiments.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread m . roth
Vladimir Budnev wrote:
 2011/3/22 m.r...@5-cent.us
CHOMP
  So with 2 4-core Xeons, I don't understand how you can get 3x and 5x.
  Could you post some raw messages, either from /var/log/message or
  from /var/log/mcelog?
 
  sure here they are before night party:
  MCE 24
  CPU 52 BANK 8 TSC 372a290717a
  MISC 68651f81186 ADDR 7dd2ad840
  STATUS cc000281009f MCGSTATUS 0
  MCE 25
 snip
 At this point, I throw up my hands. I have *no* idea how they could get
 numbers like CPU 52, unless something's wrong in the o/s - I mean, you
 are running 64 bit, right?

 Yeah, x86_64
 I have an idea dunnothe thing is we r runngin 4.8 centos. Its old
 enough and mcelog version is old enough also, mb it decodes something
completely
 wrong.

It could be that 4.8 doesn't really understand the CPU.

 Anyway thanks so much for your time and answers. Hope we will find those
 dimms in experiments.

Seriously - how old is this? I think you should call your vendor: some
will give you phone or email support, even after the end of warranty.

 mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-22 Thread Charles Polisher
m.r...@5-cent.us wrote:
 Vladimir Budnev wrote:
  2011/3/22 m.r...@5-cent.us
  Vladimir Budnev wrote:
   2011/3/21 m.r...@5-cent.us
   Vladimir Budnev wrote:
  snip, memory reseating
   Now we are just waiting will there be errors again.
 
  I'm sure there will. Reseating the memory may have done something, but
  there will, I'll wager.
 
  mark, you are absolutely right :) Approximetely 1h ago errors appeared.
  They appeared only once since reboot, but they r back. Hi there :(

Here's a guess why you're having this problem:
http://lmgtfy.com/?q=RAM+latent+junction+failure
I suspect you're going to have problems again in a month or so.
I hope I'm wrong.
-- 
Charles Polisher

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


[CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-21 Thread Vladimir Budnev
Hello community.

We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon
E5630 and 8xKingston KVR1333D3D4R9S/4G

For some time we have lots of MCE in mcelog and we cant find out the reason.
Ordinary mce message looks like:

CPU 51 BANK 8 TSC 8511e3ca77dc
MISC 274d587f6141 ADDR 807044840
STATUS cc005501009f MCGSTATUS 0

decode with mcelog --ascii --cpu p4(cause there is no xeon56xx in list):

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 53 BANK 8 TSC 1982d8f72b1f
MISC e1742eac6242 ADDR 7ffd78a80
MCG status:
MCi status:
Error overflow
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
STATUS cc000201009f MCGSTATUS 0

The global question is it possible to find out the exact hw which causes
those messages?
First we thought that according to

/* A machine check record */
struct mce {
__u64 status;   /* bank status register */
__u64 misc; /* misc register (always 0 right now) */
__u64 addr; /* address or 0 */
__u64 mcgstatus; /* global MC status register */
__u64 rip;  /* Program counter or 0 for silent error */
__u64 tsc;  /* cpu time stamp counter */
__u64 res1; /* for future extension */
__u64 res2; /* dito. */
__u8 cs;/* code segment */
__u8 bank;  /* machine check bank */
__u8 cpu;   /* cpu that raised the error */
__u8 finished; /* entry is valid */
__u32 pad;
};

cpu is the cpu rised the exception, but we have 2 quadro cpus with HT so
maximum cpu number should be 16 and in logs we see 53 etc.
So no we r not sure about what cpu value is :)Does anyone know what the CPU
number means exactly?

One more interesting thins is the following output:
[root@zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq
32
33
34
35
50
51
52
53

Those numbers are always the same.

Ok.Supposed we have problem in RAM, since i dont really know what those cpu
numbers mean we suppose that cpu+bank can point the problem hw.Is it
possible?
According to our broken ram theory we suppose that those numbers
32,33,34,45 and 50,51,52,53 indicate some simetric problem with ram/or slots
or smth else.Is it correct?

Thanks in advance.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Cant find out MCE reason (CPU 35 BANK 8)

2011-03-21 Thread m . roth
Vladimir Budnev wrote:
 Hello community.

 We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon
 E5630 and 8xKingston KVR1333D3D4R9S/4G

 For some time we have lots of MCE in mcelog and we cant find out the
 reason.

The only thing that shows there (when it shows, since sometimes it doesn't
seem to) is a hardware error. You *WILL* be replacing hardware, sometime
soon, like yesterday.

Normal is not: *ANYTHING* here is Bad News. First, you've got DIMMs
failing.  CPU 53, assuming this system doesn't have 53+ physical CPUs,
means that you have x-core systems, so you need to divide by x, so that if
it's a 12-core system with 6 physical chips, that would make it DIMM 8
associated with that physical CPU.
snip
 One more interesting thins is the following output:
 [root@zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq
 32
 33
 34
 35
 50
 51
 52
 53

 Those numbers are always the same.

Bad news: you have *two* DIMMs failing, one associated with the physical
CPU that has core 53, and another associated with the physical CPU that
has cores 32-35.

Talk to your OEM support to help identify which banks need replacing,
and/or find a motherboard diagram.

  mark, who has to deal *again* with one machine with the same
problem

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos