Hi Martin,

Thank you for the feedback. I will update the web page, and also inform the sys-admin I was working with. He was very curious why it wasn't working on other machines.

Did you want me to put your name on the webpage with this information? I think I should, but I don't want to put people's names where they don't want.

Troy

Martin Flemming wrote:
Hi, Troy et all !


Good news from the hardware-front ...

I 've found the solution under
http://www.sun.com/products-n-solutions/hardware/docs/html/819-4347-14/software.html#58439

RHEL4 NMI Watchdog Timer Must Be Disabled In Servers With BIOS 38 (6486170)

The Non-Maskable Interrupt (NMI) Watchdog in RHEL4 is a mechanism used by software and hardware developers to detect system lockups during development. The NMI Watchdog periodically checks the CPU status to determine if a program is holding the CPU in an interrupted state for an extended period of time.

It has been observed in servers runnning BIOS 38 that the SMP kernel in RHEL4 will not boot without crashing when the NMI watchdog is enabled. If the watchdog timer is disabled, the server running RHEL4 will boot with no problems.
Workaround

Disable the watchdog timer on RHEL4 by performing the following steps:

1. Log in as superuser (root).

2. Edit the /boot/grub/menu.lst file.

3. At the end of each line that begins with kernel, append this text:

nmi_watchdog=0s

4. Save the changes to the file.

5. Reboot the system.

After appending "nmi_watchdog=0s" to /boot/grub/menu.lst

all kernels (kernel-largesmp-2.6.9-42.0.3.EL.x86_64 && and my own kernel-smp-2.6.9-42.0.4.EL.x86_64 with CONFIG_NR_CPUS=16 )

works great with all cpu's ..

Cheers & nice weekend

Martin

______________________________________________________
Martin Flemming
DESY / IT          office : Building 2b / 008a
Notkestr. 85       phone  : 040 - 8998 - 4667
22603 Hamburg      mail   : [EMAIL PROTECTED]
______________________________________________________



On Fri, 16 Mar 2007, Martin Flemming wrote:

Hi, Troy !

I will test the kernel-largesmp-2.6.9-42.0.10.EL.x86_64 as soon as
possible, but unfortunatley my machine is not really my machines ...

One of our scientific-groups has got the ownership and it's still in
production too .... :-)

I've contaced them for testing this new kernel and still waiting
for an answer ...

I will report to you again if i've tested the kernel ..

Cheers & nice weeekend

         martin

On Fri, 16 Mar 2007, Troy Dawson wrote:

Hi Martin and all,
I've just double checked with Sascha, the admin for the machine.
Remember, this is in production, so he can't do any tests.

It is currently running kernel kernel-largesmp-2.6.9-42.0.10.EL.x86_64
It see's all 16 CPU's (8 dual core opteron's) (I have the output of
cpuinfo if you need)
Output of uname -a
Linux <hostname deleted> 2.6.9-42.0.10.ELlargesmp #1 SMP Tue Feb 27
12:54:30 EST 2007 x86_64 x86_64 x86_64 GNU/Linux

I have the output of grub if you want, but the import part looks normal

title Scientific Linux SL (2.6.9-42.0.10.ELlargesmp)
        root (hd0,0)
        kernel /boot/vmlinuz-2.6.9-42.0.10.ELlargesmp ro root=LABEL=/
message=/boot/boot.msg console=tty0 console=ttyS0,9600N8 rhgb quiet
        initrd /boot/initrd-2.6.9-42.0.10.ELlargesmp.img

Maybe it's the original S.L. 4.4 x86_64 kernel (2.6.9-42.0.3) that is
having the problems.  Or maybe it's some setting in the bios.
Does the kernel crash go away when you update the kernel to
2.6.9-42.0.10.ELlargesmp?

Troy

Troy Dawson wrote:
Hi Martin,
I'm double checking right now, but it might be a day or two.  The
machine in question is in germany, and is in production right now, so I
have to contact the system administrator to get the information.

I do know that for i386, I saw all 16 CPU's and had no problems at all
(with SL 4.4).
For x86_64 my data somehow got blanked.  You know you mean to save a
file and push the wrong keys, and you don't notice until your test
system is away in production.

I will get that information and update the page if it needs be.

Thanks
Troy

Martin Flemming wrote:
Hi, Stephen !

Yep, this was also my thought,
but this kernel "kernel-largesmp-2.6.9-42.0.3.EL.x86_64"
crashes as i remarked ...

Any other ideas ?

cheers,
             Martin

On Thu, 15 Mar 2007, Stephen J. Gowdy wrote:

It looks like you need largesmp (assuming you have the 8 dual-core
CPU version, most option look to only include 4 dual-core CPUs);

"Please note that limits for <USV> v4 are for Update 3 or later.
Update 3 was released in March 2006. CPU counts over 8 (AMD64/EM64T)
or 64 (other architectures) require use of the largesmp kernel.
Certified limits reflect the current state of system testing by <USV>
and its partners, and the limit of support provided by a <USV> Linux
subscription."

On Thu, 15 Mar 2007, Martin Flemming wrote:

Hi, Troy et all !

I've recognized today,
that on the hardware-webside

https://www.scientificlinux.org/documentation/hardware/

you've published the sucessfull
installation of a "Sun Fire x4600"-Machine ...

We've got the same machine in our lab, but unfortunatley
we see only 8 cpu's not 16 ...

So my question is, which kernel do you have installed ?

I've installed following one:

kernel-smp-2.6.9-42.0.3.EL.x86_64

which displays only 8 CPU's ...

At first, i've got the largesmp-kernel

kernel-largesmp-2.6.9-42.0.3.EL.x86_64

but this one generate a kernelpanic ....


Cheers,

      Martin



--
__________________________________________________
Troy Dawson  [EMAIL PROTECTED]  (630)840-6468
Fermilab  ComputingDivision/LCSI/CSI DSS Group
__________________________________________________

Reply via email to