Some very strange goings-on:

I did in fact try updating the files in /tftpboot, and that seemed to make the difference: for a brief time (enough to image 61/64 nodes) was able to image them! The tg3 error went away and I was able to image them. However, on the 64th node, suddenly (I believe without any intervention from me) the boot error (tg3: problem fetching invariants) began again (the other two nodes were 3 and 39--I just missed their netboots accidentally).

I tried for several hours to reproduce the successful server setup, with Frank's tarball unzipped in /usr/share/systemimager/boot/i386/standard, and copying the kernel and initrd.img to /tftpboot over and over again at various different times, etc., but have been unable to reproduce the halcyon days of 3:30 in the afternoon yesterday, where everything ephemerally worked.

The good news is that we know that the imaging CAN work on these nodes. The bad news is that I haven't had much luck reproducing it, and this is silly, because I've done it before!

Currently, my error (when trying to image the nodes) is the tg3 error again, and it definitely is v. 1.5 (1.2 also doesn't work).

I have been going around in circles now trying to reproduce something that I think should work to the point that I am frazzled.

To get this to work properly again (for imaging at a later date) maybe I should just reinstall the OS and everything from scratch. I think that all I should have to do is:

a) copy Frank's tarball to /usr/share/systemimager/boot/i386/standard
b) tar xvf it
c) copy /usr/share/systemimager/boot/i386/standard/kernel and /usr/share/systemimager/boot/i386/standard/initrd.img to /tftpboot
d) ensure that I have /var/lib/system/systemimager/overrides/IMAGENAME/etc/modules.conf consistent with the 1750s
e) choose "Network Setup" in OSCAR installer and "prepare network boot"


and go (I've tried copying Frank's kernel and initrd.img both before and after choosing prepare network boot from the oscar installer.).

Incidentally, the head node is a Dell 2650 (all the other nodes are 1750s), and it's crashing occasionally running RH 9.0, which is not good--I'm going to have to fix that before we go to production. Anyone run into this problem?

Jason


On Sep 21, 2004, at 7:32 PM, Frank Crawford wrote:

Jason,
Can you get one other piece of info, what kernel does the workstation
think it is booting. Basically the first line or two of the boot should
give the version number, etc.


        You are right that your problems sound so similar, and more likely
related to the wrong kernel booting.  I'm pretty sure that the tg3
driver was 1.4, not 1.2, so it does look like the wrong kernel.

Frank

On Wed, 2004-09-22 at 05:23, Jason Hlady wrote:
I just tried using the stock kernel files from OSCAR;

I got very similar results:

tg3.c: v1.2 (Nov 14, 2002)
tg3: Problem fetching invariants of chip, aborting
tg3: Problem fetching invariants of chip, aborting
sk98lin: no adapter found

<stuff>

VFS: Mounted root (cramfs filesystem)
Mounted devfs on /dev
Freeing unused kernel memory: 524k freed
kernel NULL pointer dereference at virtual address 00000000
kernel panic

Yes, it is possible that the NICs that are in these 1750s have been
changed from the previous version of 1750s, without them telling us,
and possibly the NICs no longer work with the drivers. Grrrr. The
fact that the errors we are getting are so similar to previous errors
that other people have seen (i.e. this tg3 invariants error), however,
makes me wonder if I'm just doing something wrong. Am I correct that
making a modification to /usr/share/systemimager/boot/i386/standard/*
(i.e. kernel, config, boel_binaries.gz) and then restarting the "setup
networking" should be sufficient? That is, I don't actually need to
modify the files in /tftpboot because they will have been automatically
changed by those two steps?


Jason



On Sep 21, 2004, at 12:01 PM, Bernard Li wrote:

Hi Jason:

I have used Frank Crawford's files with a wide variety of bcm57xx nics
and they all work fine... I guess it's possible that you are using
nics
that don't work with the drivers?


Have you also tried using the stock kernel files from OSCAR and see if
that gives you different error messages?


Cheers,

Bernard

-----Original Message-----
From: Jason Hlady [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 21, 2004 10:33
To: Bernard Li
Cc: [EMAIL PROTECTED]; Jason Hlady; Frank Crawford
Subject: Re: [Oscar-users] RH9.0, Oscar3.0, Dell
1750PowerEdge, tg3 install woes


On Sep 21, 2004, at 11:10 AM, Bernard Li wrote:

Hey Jason:

Just a quick question - are the specs of the 2 sets of
PowerEdge 1750
identical?  It seems kind of odd that it worked on the original
hardware but not the newer one, unless they have some subtle
changes...  Also, do they both have SCSI harddrives...?

No, I imagine that the hardware is not exactly the same, simply because it works on one and not the other. They both have SCSI hard drives (different size and speed, but that shouldn't matter), both using the onboard NICs.... As to the exact specs, that's something I haven't actually tracked down, but given that it's having trouble with the NICs (and possibly the SCSI system) I didn't check out component-by-component what has changed. I certainly can do that.

Jason



Cheers,

Bernard

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
Jason Hlady
Sent: Tuesday, September 21, 2004 9:52
To: [EMAIL PROTECTED]
Cc: Frank Crawford; Jason Hlady
Subject: [Oscar-users] RH9.0, Oscar3.0, Dell 1750PowerEdge,
tg3 install woes

Hi all,

This is an especially frustrating help letter to have to
write. :)  I will explain why:

1) I have successfully used (December '03) OSCAR 3.0 on
Redhat 9.0 to install on 32 Dell PowerEdge 1750 servers.

2) Recently, another researcher has purchased a 64 node
cluster of Dell PowerEdge 1750 servers (the servers arrived
September '04) and I am setting this cluster up using RH9 and
OSCAR 3.0.  I am attempting to use the exact same
configuration as I used for MY cluster, which is happily
running OSCAR right now.

I pursued the standard install of OSCAR.  Because I've done
this once before on what should presumably be identical
hardware, I remembered
to:

a) replace /usr/share/systemimager/boot/i386/standard/* with
the tarfile from Frank Crawford, who had given me the
boel_binaries.tar.gz, kernel, config, and initrd.img files
that will be used.  This EXACT set of files enabled me to do
the imaging on my cluster of PowerEdge 1750s in December of 2003.

b) create an
/var/lib/systemimager/override/IMAGENAME/etc/modules.conf
containing the EXACT same file as that file on my previous
cluster so that the machines would remember to load the drivers.

However, when I start up network boot on the new server, and
netboot one of the new clients, it gets DHCP, receives the
correct DHCP address, and then begins to load the imaging
kernel.  However, I get the following errors (I had to write
them down, so these are just excerpts, albeit in
chronological order)

tg3: (02:00.0) phy probe failed, err -16 tg3: problem fetching invariants of chip, aborting tg3: (02:00.1) phy probe failed, err -16 tg3: problem fetching invariants of chip, aborting

<stuff>

SCSI subsystem driver Revision: 1.00
kmod: failed to execv /sbin/modprobe -s -k
scsi_hostadapter, errno = 2

< stuff>

FusionMPT base driver 2.03.00
mptbase: Initiating ioc0 bringup
mptbase: ioc0: WARNING: unexpected doorbell active
mptbase: ioc0: ERROR: doorbell ACK timeout (2)

<more stuff>

VFS: Mounted root (cramfs filesystem)
Mounted devfs on /dev
Freeing unused kernel memory: 524k freed Unable to handle
kernel NULL pointer dereference at virtual address <>
EIP: 0060:<c0264257>

< BUNCH OF NUMBERS>

Kernel panic: attempted to stop init!

And then it dies.

This is pretty annoying.  I had assumed that it would JUST
WORK given that the hardware, software, and operating system
(except for the head node, which is a 2650) is (nominally?)
identical in both cases.

I did a little searching on the net for this "tg3: problem
fetching invariants of chip, aborting" error, and turned
up this link,

http://www.mail-archive.com/[EMAIL PROTECTED]/ msg00705.html

which has another source of these boel_binaries, etc that
should ALSO work.  They do not work either for this new
cluster I am attempting to
install: they get a similar tg3 error, and then fail.

What is going on here?  I've read that some people have been
happy with
tg3 and some with bc5700... I was perfectly happy with tg3
until they don't seem to work for these *particular* Dell
1750s. :-(

And what about the kmod: failed to execv /sbin/modprobe -s -k scsi_hostadapter, errno = 2 error? Does that suggest that it hasn't correctly loaded the SCSI driver EITHER?

Does anyone have any suggestions?  What exactly is involved
(as much detail as possible would be appreciated) in trying
to make my very own set of boel_binaries/kernel/initrd.img?

Have I missed something really obvious?  Can anybody suggest
something?
  Everyone was so helpful getting it to work correctly the
first time that I thought I'd take another crack at the list. :-)

Thanks a bunch,

Jason

--------------
Jason Hlady, B. Sc., M. Sc. (Chem), Adv. Cert. (Comp. Sci.)
Programmer/Analyst (Bioinformatics Specialist) U of
Saskatchewan, Bioinformatics Research Laboratory (BIRL)
[EMAIL PROTECTED] (306) 966-2075



-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one
of 170 Project Admins to receive an Apple iPod Mini FREE for
your judgement on who ports your project to Linux PPC the
best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users



--------------
Jason Hlady, B. Sc., M. Sc. (Chem), Adv. Cert. (Comp. Sci.)
Programmer/Analyst (Bioinformatics Specialist)
U of Saskatchewan, Bioinformatics Research Laboratory (BIRL)
[EMAIL PROTECTED] (306) 966-2075




--------------
Jason Hlady, B. Sc., M. Sc. (Chem), Adv. Cert. (Comp. Sci.)
Programmer/Analyst (Bioinformatics Specialist)
U of Saskatchewan, Bioinformatics Research Laboratory (BIRL)
[EMAIL PROTECTED] (306) 966-2075
--
ac3
Suite G16, Bay 7, Locomotive Workshop   Phone:  02 9209 4600
Australian Technology Park              Fax:    02 9209 4611
Eveleigh   NSW   1430

--------------
Jason Hlady, B. Sc., M. Sc. (Chem), Adv. Cert. (Comp. Sci.)
Programmer/Analyst (Bioinformatics Specialist)
U of Saskatchewan, Bioinformatics Research Laboratory (BIRL)
[EMAIL PROTECTED] (306) 966-2075



-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to