I'm having this EXACT issue. I have 32 1750s running (I actually got the Frank's tarball from Jason) and I'm now trying to add 4 more identical nodes. They do exactly what Jason is saying his did. Does anyone know how to resolve this?
-----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michael Edwards Posted At: Wednesday, September 22, 2004 3:38 PM Posted To: OSCAR Conversation: [Oscar-users] RH9.0, Oscar3.0, Dell 1750PowerEdge, tg3 install woes Subject: Re: [Oscar-users] RH9.0, Oscar3.0, Dell 1750PowerEdge, tg3 install woes I had to change modules.conf when I was using a network card that the OS did not initially detect correctly. Well, I did and it worked, anyway. On Wed, 22 Sep 2004 11:51:35 -0700, Bernard Li <[EMAIL PROTECTED]> wrote: > Hey Jason: > > I never had to muck around with the modules.conf file, perhaps you can > omit this step? > > It probably is a good idea to start a clean slate. You only need to > press the 'setup network boot' button once, and then you can copy the > kernel boot files to /tftpboot and you should never need to touch them > again. > > Hopefully it'll work this time with a clean install - BTW, you > mentioned crashes with RH9, what kind of crashes (any error logs?) - > Also, have you installed all the updates that are available from > RedHat? > > Cheers, > > Bernard > > > > > -----Original Message----- > > From: Jason Hlady [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, September 22, 2004 11:18 > > To: Frank Crawford > > Cc: Bernard Li; oscar-users@lists.sourceforge.net; Jason Hlady > > Subject: Re: [Oscar-users] RH9.0, Oscar3.0, Dell 1750PowerEdge, tg3 > > install woes > > > > Some very strange goings-on: > > > > I did in fact try updating the files in /tftpboot, and that seemed > > to make the difference: for a brief time (enough to image 61/64 > > nodes) was able to image them! The tg3 error went away and I was > > able to image them. However, on the 64th node, suddenly (I believe > > without any intervention from me) the boot error (tg3: problem > > fetching invariants) began again (the other two nodes were 3 and > > 39--I just missed their netboots accidentally). > > > > I tried for several hours to reproduce the successful server setup, > > with Frank's tarball unzipped in > > /usr/share/systemimager/boot/i386/standard, and copying the kernel > > and initrd.img to /tftpboot over and over again at various different > > times, etc., but have been unable to reproduce the halcyon days of > > 3:30 in the afternoon yesterday, where everything ephemerally > > worked. > > > > The good news is that we know that the imaging CAN work on these > > nodes. > > The bad news is that I haven't had much luck reproducing it, and > > this is silly, because I've done it before! > > > > Currently, my error (when trying to image the nodes) is the tg3 > > error again, and it definitely is v. 1.5 (1.2 also doesn't work). > > > > I have been going around in circles now trying to reproduce > > something that I think should work to the point that I am frazzled. > > > > To get this to work properly again (for imaging at a later > > date) maybe I should just reinstall the OS and everything from > > scratch. I think that all I should have to do is: > > > > a) copy Frank's tarball to > > /usr/share/systemimager/boot/i386/standard > > b) tar xvf it > > c) copy /usr/share/systemimager/boot/i386/standard/kernel and > > /usr/share/systemimager/boot/i386/standard/initrd.img to /tftpboot > > d) ensure that I have > > /var/lib/system/systemimager/overrides/IMAGENAME/etc/modules.conf > > consistent with the 1750s > > e) choose "Network Setup" in OSCAR installer and "prepare > > network boot" > > > > and go (I've tried copying Frank's kernel and initrd.img both before > > and after choosing prepare network boot from the oscar installer.). > > > > Incidentally, the head node is a Dell 2650 (all the other nodes are > > 1750s), and it's crashing occasionally running RH 9.0, which is not > > good--I'm going to have to fix that before we go to production. > > Anyone run into this problem? > > > > Jason > > > > > > On Sep 21, 2004, at 7:32 PM, Frank Crawford wrote: > > > > > Jason, > > > Can you get one other piece of info, what kernel does > > the workstation > > > think it is booting. Basically the first line or two of the boot > > > should give the version number, etc. > > > > > > You are right that your problems sound so similar, and > > more likely > > > related to the wrong kernel booting. I'm pretty sure that the tg3 > > > driver was 1.4, not 1.2, so it does look like the wrong kernel. > > > > > > Frank > > > > > > On Wed, 2004-09-22 at 05:23, Jason Hlady wrote: > > >> I just tried using the stock kernel files from OSCAR; > > >> > > >> I got very similar results: > > >> > > >> tg3.c: v1.2 (Nov 14, 2002) > > >> tg3: Problem fetching invariants of chip, aborting > > >> tg3: Problem fetching invariants of chip, aborting > > >> sk98lin: no adapter found > > >> > > >> <stuff> > > >> > > >> VFS: Mounted root (cramfs filesystem) Mounted devfs on > > /dev Freeing > > >> unused kernel memory: 524k freed kernel NULL pointer > > dereference at > > >> virtual address 00000000 kernel panic > > >> > > >> Yes, it is possible that the NICs that are in these 1750s > > have been > > >> changed from the previous version of 1750s, without them > > telling us, > > >> and possibly the NICs no longer work with the drivers. > > Grrrr. The > > >> fact that the errors we are getting are so similar to > > previous errors > > >> that other people have seen (i.e. this tg3 invariants error), > > >> however, makes me wonder if I'm just doing something wrong. Am I > > >> correct that making a modification to > > >> /usr/share/systemimager/boot/i386/standard/* > > >> (i.e. kernel, config, boel_binaries.gz) and then restarting the > > >> "setup networking" should be sufficient? That is, I don't > > actually > > >> need to modify the files in /tftpboot because they will have been > > >> automatically changed by those two steps? > > >> > > >> Jason > > >> > > >> > > >> > > >> On Sep 21, 2004, at 12:01 PM, Bernard Li wrote: > > >> > > >>> Hi Jason: > > >>> > > >>> I have used Frank Crawford's files with a wide variety of > > >>> bcm57xx nics and they all work fine... I guess it's possible > > that you are > > >>> using nics that don't work with the drivers? > > >>> > > >>> Have you also tried using the stock kernel files from > > OSCAR and see > > >>> if that gives you different error messages? > > >>> > > >>> Cheers, > > >>> > > >>> Bernard > > >>> > > >>>> -----Original Message----- > > >>>> From: Jason Hlady [mailto:[EMAIL PROTECTED] > > >>>> Sent: Tuesday, September 21, 2004 10:33 > > >>>> To: Bernard Li > > >>>> Cc: oscar-users@lists.sourceforge.net; Jason Hlady; > > Frank Crawford > > >>>> Subject: Re: [Oscar-users] RH9.0, Oscar3.0, Dell > > 1750PowerEdge, tg3 > > >>>> install woes > > >>>> > > >>>> > > >>>> On Sep 21, 2004, at 11:10 AM, Bernard Li wrote: > > >>>> > > >>>>> Hey Jason: > > >>>>> > > >>>>> Just a quick question - are the specs of the 2 sets of > > >>>> PowerEdge 1750 > > >>>>> identical? It seems kind of odd that it worked on the > > >>>>> original hardware but not the newer one, unless they have some > > >>>>> subtle changes... Also, do they both have SCSI harddrives...? > > >>>> > > >>>> No, I imagine that the hardware is not exactly the same, simply > > >>>> because it works on one and not the other. They both have SCSI > > >>>> hard drives (different size and speed, but that > > shouldn't matter), > > >>>> both using the onboard NICs.... As to the exact specs, that's > > >>>> something I haven't actually tracked down, but given that it's > > >>>> having trouble with the NICs (and possibly the SCSI system) I > > >>>> didn't check out component-by-component what has changed. I > > >>>> certainly can do that. > > >>>> > > >>>> Jason > > >>>> > > >>>> > > >>>>> > > >>>>> Cheers, > > >>>>> > > >>>>> Bernard > > >>>>> > > >>>>>> -----Original Message----- > > >>>>>> From: [EMAIL PROTECTED] > > >>>>>> [mailto:[EMAIL PROTECTED] On Behalf Of > > >>>>>> Jason Hlady > > >>>>>> Sent: Tuesday, September 21, 2004 9:52 > > >>>>>> To: oscar-users@lists.sourceforge.net > > >>>>>> Cc: Frank Crawford; Jason Hlady > > >>>>>> Subject: [Oscar-users] RH9.0, Oscar3.0, Dell 1750PowerEdge, > > >>>>>> tg3 install woes > > >>>>>> > > >>>>>> Hi all, > > >>>>>> > > >>>>>> This is an especially frustrating help letter to have > > to write. > > >>>>>> :) I will explain why: > > >>>>>> > > >>>>>> 1) I have successfully used (December '03) OSCAR 3.0 on > > >>>>>> Redhat 9.0 to install on 32 Dell PowerEdge 1750 servers. > > >>>>>> > > >>>>>> 2) Recently, another researcher has purchased a 64 > > node cluster > > >>>>>> of Dell PowerEdge 1750 servers (the servers arrived September > > >>>>>> '04) and I am setting this cluster up using RH9 and > > OSCAR 3.0. I > > >>>>>> am attempting to use the exact same configuration as I > > used for > > >>>>>> MY cluster, which is happily running OSCAR right now. > > >>>>>> > > >>>>>> I pursued the standard install of OSCAR. Because I've > > done this > > >>>>>> once before on what should presumably be identical hardware, > > >>>>>> I remembered > > >>>>>> to: > > >>>>>> > > >>>>>> a) replace > > /usr/share/systemimager/boot/i386/standard/* with the > > >>>>>> tarfile from Frank Crawford, who had given me the > > >>>>>> boel_binaries.tar.gz, kernel, config, and initrd.img > > files that > > >>>>>> will be used. This EXACT set of files enabled me to do the > > >>>>>> imaging on my cluster of PowerEdge 1750s in December of 2003. > > >>>>>> > > >>>>>> b) create an > > >>>> /var/lib/systemimager/override/IMAGENAME/etc/modules.conf > > >>>>>> containing the EXACT same file as that file on my previous > > >>>>>> cluster so that the machines would remember to load > > the drivers. > > >>>>>> > > >>>>>> However, when I start up network boot on the new server, and > > >>>>>> netboot one of the new clients, it gets DHCP, receives the > > >>>>>> correct DHCP address, and then begins to load the > > imaging kernel. > > >>>>>> However, I get the following errors (I had to write > > them down, so > > >>>>>> these are just excerpts, albeit in > > >>>> chronological order) > > >>>>>> > > >>>>>> tg3: (02:00.0) phy probe failed, err -16 > > >>>>>> tg3: problem fetching invariants of chip, aborting > > >>>>>> tg3: (02:00.1) phy probe failed, err -16 > > >>>>>> tg3: problem fetching invariants of chip, aborting > > >>>>>> > > >>>>>> <stuff> > > >>>>>> > > >>>>>> SCSI subsystem driver Revision: 1.00 > > >>>>>> kmod: failed to execv /sbin/modprobe -s -k > > >>>> scsi_hostadapter, errno = 2 > > >>>>>> > > >>>>>> < stuff> > > >>>>>> > > >>>>>> FusionMPT base driver 2.03.00 > > >>>>>> mptbase: Initiating ioc0 bringup > > >>>>>> mptbase: ioc0: WARNING: unexpected doorbell active > > >>>>>> mptbase: ioc0: ERROR: doorbell ACK timeout (2) > > >>>>>> > > >>>>>> <more stuff> > > >>>>>> > > >>>>>> VFS: Mounted root (cramfs filesystem) Mounted devfs on /dev > > >>>>>> Freeing unused kernel memory: 524k freed Unable to > > handle kernel > > >>>>>> NULL pointer dereference at virtual address <> > > >>>>>> EIP: 0060:<c0264257> > > >>>>>> > > >>>>>> < BUNCH OF NUMBERS> > > >>>>>> > > >>>>>> Kernel panic: attempted to stop init! > > >>>>>> > > >>>>>> And then it dies. > > >>>>>> > > >>>>>> This is pretty annoying. I had assumed that it would > > JUST WORK > > >>>>>> given that the hardware, software, and operating > > system (except > > >>>>>> for the head node, which is a 2650) is (nominally?) > > identical in > > >>>>>> both cases. > > >>>>>> > > >>>>>> I did a little searching on the net for this "tg3: problem > > >>>>>> fetching invariants of chip, aborting" error, and turned > > >>>> up this link, > > >>>>>> > > >>>>>> > > http://www.mail-archive.com/sisuite-users@lists.sourceforge.net/ > > >>>>>> msg00705.html > > >>>>>> > > >>>>>> which has another source of these boel_binaries, etc > > that should > > >>>>>> ALSO work. They do not work either for this new cluster I am > > >>>>>> attempting to > > >>>>>> install: they get a similar tg3 error, and then fail. > > >>>>>> > > >>>>>> What is going on here? I've read that some people have been > > >>>>>> happy with tg3 and some with bc5700... I was perfectly happy > > >>>>>> with > > tg3 until > > >>>>>> they don't seem to work for these *particular* Dell > > >>>> 1750s. :-( > > >>>>>> > > >>>>>> And what about the kmod: failed to execv /sbin/modprobe -s -k > > >>>>>> scsi_hostadapter, errno = 2 error? Does that suggest that it > > >>>>>> hasn't correctly loaded the SCSI driver EITHER? > > >>>>>> > > >>>>>> Does anyone have any suggestions? What exactly is > > involved (as > > >>>>>> much detail as possible would be appreciated) in > > trying to make > > >>>>>> my very own set of boel_binaries/kernel/initrd.img? > > >>>>>> > > >>>>>> Have I missed something really obvious? Can anybody suggest > > >>>>>> something? > > >>>>>> Everyone was so helpful getting it to work correctly > > the first > > >>>>>> time that I thought I'd take another crack at the list. :-) > > >>>>>> > > >>>>>> Thanks a bunch, > > >>>>>> > > >>>>>> Jason > > >>>>>> > > >>>>>> -------------- > > >>>>>> Jason Hlady, B. Sc., M. Sc. (Chem), Adv. Cert. (Comp. Sci.) > > >>>>>> Programmer/Analyst (Bioinformatics Specialist) U of > > Saskatchewan, > > >>>>>> Bioinformatics Research Laboratory (BIRL) > > [EMAIL PROTECTED] (306) > > >>>>>> 966-2075 > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> ------------------------------------------------------- > > >>>>>> This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one > > >>>>>> of 170 Project Admins to receive an Apple iPod Mini FREE for > > >>>>>> your judgement on who ports your project to Linux PPC the > > >>>>>> best. Sponsored by IBM. > > >>>>>> Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php > > >>>>>> _______________________________________________ > > >>>>>> Oscar-users mailing list Oscar-users@lists.sourceforge.net > > >>>>>> https://lists.sourceforge.net/lists/listinfo/oscar-users > > >>>>>> > > >>>>>> > > >>>>>> > > >>>> -------------- > > >>>> Jason Hlady, B. Sc., M. Sc. (Chem), Adv. Cert. (Comp. Sci.) > > >>>> Programmer/Analyst (Bioinformatics Specialist) U of > > Saskatchewan, > > >>>> Bioinformatics Research Laboratory (BIRL) > > [EMAIL PROTECTED] (306) > > >>>> 966-2075 > > >>>> > > >>>> > > >>>> > > >>>> > > >> -------------- > > >> Jason Hlady, B. Sc., M. Sc. (Chem), Adv. Cert. (Comp. Sci.) > > >> Programmer/Analyst (Bioinformatics Specialist) U of Saskatchewan, > > >> Bioinformatics Research Laboratory (BIRL) [EMAIL PROTECTED] (306) > > >> 966-2075 > > > -- > > > ac3 > > > Suite G16, Bay 7, Locomotive Workshop Phone: 02 9209 4600 > > > Australian Technology Park Fax: 02 9209 4611 > > > Eveleigh NSW 1430 > > > > > -------------- > > Jason Hlady, B. Sc., M. Sc. (Chem), Adv. Cert. (Comp. Sci.) > > Programmer/Analyst (Bioinformatics Specialist) U of Saskatchewan, > > Bioinformatics Research Laboratory (BIRL) [EMAIL PROTECTED] (306) > > 966-2075 > > > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 > Project Admins to receive an Apple iPod Mini FREE for your judgement > on who ports your project to Linux PPC the best. Sponsored by IBM. > Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php > _______________________________________________ > Oscar-users mailing list > Oscar-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/oscar-users > ------------------------------------------------------- This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 Project Admins to receive an Apple iPod Mini FREE for your judgement on who ports your project to Linux PPC the best. Sponsored by IBM. Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php _______________________________________________ Oscar-users mailing list Oscar-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/oscar-users The information contained in this email message is intended only for the use of the individuals to whom it is addressed and may contain information that is privileged and sensitive. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by email at the above referenced address. Thank you. ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642 _______________________________________________ Oscar-users mailing list Oscar-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/oscar-users