YES!  That solved the problem.  I think I did run update_client.  This
is a NASTY bug!
To fix my nodes, I booted into each one with a rescue disk and ran

chroot /mnt/sysimage
systemconfigurator --configsi

It gave me an error message about the ram disk.  I then rebooted, and
everything is running fine!  Thank all of you for the great support.

        |-----Original Message-----
        |From: Michael Chase-Salerno [mailto:[EMAIL PROTECTED] 
        |Sent: Wednesday, March 19, 2003 4:45 PM
        |To: Chris Oubre
        |Cc: Oscar-users List
        |Subject: Re: [Oscar-users] RE: OSCAR cluster dead 
        |after simple changes
        |
        |
        |Did you happen to run an update_client on the nodes at 
        |some point? This will have wiped out some of the 
        |configuration that was done by systemconfigurator, 
        |which is who creates those files. The networking files 
        |are probably missing/incorrect also. Currently running 
        |update_client is not supported on an OSCAR cluster for 
        |this reason. 
        |
        |If that's the case, you should be able to run 
        |systemconfigurator again on the nodes from a booted 
        |rescue disk. You'll have to mount all the filesystems 
        |and run it chrooted. You can steal the proper command 
        |from the autoinstall script for the image in 
        |/var/lib/systemimager/scripts.
        |
        |Mike
        |
        |On Wed, 2003-03-19 at 17:31, Chris Oubre wrote:
        |> Ok I booted into a node via rescue mode from a Red Hat 7.2 
        |> installation CD. I noticed that there are certain 
        |files missing.
        |> 
        |> There is no
        |> /etc/lilo.conf
        |> /boot/map
        |> 
        |> And no ramdisk
        |> 
        |> Where does OSCAR keep these files?  They are also not in the 
        |> /var/lib/systemimager/images/oscarimage/ tree on the 
        |master node.
        |> 
        |> This seems odd because I get the LIL- error message 
        |and that seems to 
        |> me that it is loading a lilo.conf file but I cannot 
        |find that file 
        |> anywhere on the nodes.
        |> 
        |>         |-----Original Message-----
        |>         |From: Keller, Gregory W xNON-EMPLOYEEx 
        |>         |[mailto:[EMAIL PROTECTED] 
        |>         |Sent: Wednesday, March 19, 2003 12:50 PM
        |>         |To: '[EMAIL PROTECTED]'
        |>         |Cc: '[EMAIL PROTECTED]'
        |>         |Subject: Re: OSCAR cluster dead after simple changes
        |>         |
        |>         |
        |>         |Chris,
        |>         |The change to the switch is irrelevent.  Until LILO 
        |>         |bootstraps the kernel your network card 
        |isn't a factor 
        |>         |- so unless your diskless or using PXE to boot over 
        |>         |the network, this is an install problem on 
        |the local disks.
        |>         |
        |>         |Check out this link: 
        |>         
        ||http://user.fundy.net/cyclist/linux/troubleshoot-LILO.html
        |>         |
        |>         |Here is an excerpt that will point us in 
        |the right direction:
        |>         |  ()  No part of LILO has been loaded. LILO either 
        |>         |isn't installed 
        |>         |    or the partition on which its boot sector is 
        |>         |located isn't active. 
        |>         |   L  ...   The first stage boot loader has been 
        |>         |loaded and started, 
        |>         |    but it can't load the second stage boot loader. 
        |>         |The two-digit error 
        |>         |    codes indicate the type of problem. (See also 
        |>         |section "Disk error 
        |>         |    codes".) This condition usually 
        |indicates a media 
        |>         |failure or a geometry 
        |>         |    mismatch (e.g. bad disk parameters, see section 
        |>         |"Disk geometry"). 
        |>         |   LI   The first stage boot loader was 
        |able to load 
        |>         |the second stage boot 
        |>         |    loader, but has failed to execute it. This can 
        |>         |either be caused by a 
        |>         |    geometry mismatch or by moving /boot/boot.b 
        |>         |without running the map 
        |>         |    installer. 
        |>         |   LIL   The second stage boot loader has been 
        |>         |started, but it can't load 
        |>         |    the descriptor table from the map file. This is 
        |>         |typically caused by a 
        |>         |    media failure or by a geometry mismatch. 
        |>         |   LIL?   The second stage boot loader has 
        |been loaded 
        |>         |at an incorrect 
        |>         |    address. This is typically caused by a subtle 
        |>         |geometry mismatch or by 
        |>         |    moving /boot/boot.b without running the 
        |map installer. 
        |>         |   LIL-   The descriptor table is corrupt. This can 
        |>         |either be caused by a 
        |>         |    geometry mismatch or by moving 
        |/boot/map without 
        |>         |running the map 
        |>         |    installer. 
        |>         |   LILO   All parts of LILO have been 
        |successfully loaded. 
        |>         |
        |>         |Let us know what you find - perhaps the 
        |installer is 
        |>         |having trouble with the map installer.
        |>         |
        |>         |Keep smiling,
        |>         |Greg
        |>         |
        |>         |
        |>         |    Message: 2
        |>         |    From: "Chris Oubre" <[EMAIL PROTECTED]>
        |>         |    To: <[EMAIL PROTECTED]>
        |>         |    Date: Tue, 18 Mar 2003 17:24:18 -0600
        |>         |    Subject: [Oscar-users] OSCAR cluster dead after 
        |>         |simple changes
        |>         |
        |>         |    We are currently working on a nasty problem 
        |>         |with our OSCAR cluster.
        |>         |
        |>         |    I am running OSCAR 2.1 on RedHat 7.2 with a 
        |>         |modified kernel to
        |>         |    accommodate for my e1000 gigabit cards and 
        |>         |mylex hardware RAID.
        |>         |
        |>         |    First the problem.
        |>         |    We recently received some new nodes for our 
        |>         |Beowulf cluster.  We 
        |>         |    brought the cluster down to install them.  We 
        |>         |installed an additional 
        |>         |    network card to the switch, and manually moved 
        |>         |the position of the 
        |>         |    old nodes in the rack.  When we rebooted up the 
        |>         |cluster (master and 
        |>         |    old nodes), ALL of the old nodes do not boot 
        |>         |up.  They stop at a 
        |>         |    screen that says 
        |>         |
        |>         |
        |>         |    LIL-
        |>         |
        |>         |    I have removed the additional card from the 
        |>         |switch and tried rebooting
        |>         |    everything.  No Joy
        |>         |    I am able to ping the master node from the 
        |>         |switch and vice-versa.
        |>         |
        |>         |    If you could give some ideas of where to look?  
        |>         |Or do I need to
        |>         |    completely reinstall?  I'd rather avoid the 
        |>         |reinstall because I do not
        |>         |    know what cause the crash, and thus would be 
        |>         |susceptible to a
        |>         |    reoccurrence.
        |>         |
        |>         |
        |>         |
        |>         |****************************************************
        |>         |Christopher D. Oubre                               *
        |>         |email: [EMAIL PROTECTED]                     *
        |>         |research: http://cmt.rice.edu/~coubre              *
        |>         |Web: http://www.angelfire.com/la2/oubre            *
        |>         |Hangout: http://pub44.ezboard.com/bsouthterrebonne *
        |>         |Phone:(713)348-3541  Fax:   (713)348-4150          *
        |>         |Rice University                                    *
        |>         |Department of Physics, M.S. 61                     *
        |>         |6100 Main St.                       ^-^            *
        |>         |Houston, Tx  77251-1892, USA       (O O)           *
        |>         |-= Phlax=-                         ( v )           *
        |>         |************************************m*m*************
        |>         |
        |>         |
        |>         |
        |> 
        |> 
        |> -------------------------------------------------------
        |> This SF.net email is sponsored by: Does your code 
        |think in ink?
        |> You could win a Tablet PC. Get a free Tablet PC hat 
        |just for playing. 
        |> What are you waiting for?
        |> http://ads.sourceforge.net/cgi-bin/redirect.pl?micr5043en
        |> _______________________________________________
        |> Oscar-users mailing list
        |> [EMAIL PROTECTED]
        |> https://lists.sourceforge.net/lists/listinfo/oscar-users
        |-- 
        |Michael Chase-Salerno           [EMAIL PROTECTED]
        |IBM Linux Systems Technology    Poughkeepsie, NY 
        |System Installation Suite       www.sisuite.org
        |
        |
        |


-------------------------------------------------------
This SF.net email is sponsored by: Tablet PC.  
Does your code think in ink? You could win a Tablet PC. 
Get a free Tablet PC hat just for playing. What are you waiting for? 
http://ads.sourceforge.net/cgi-bin/redirect.pl?micr5043en
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to