If you are going to reimage the nodes to install software, then make
sure you set up the modprobe override in the image.  Doing 22 nodes by
hand once isn't too bad, but the second or third time it gets old.

I recorded some notes I made when I installed the OFED package (some
infiniband drivers) here
(http://svn.oscar.openclustergroup.org/trac/oscar/wiki/TipOFED).  What
I don't mention here, but may add, is that I found out what
dependencies were needed first by installing the code I wanted to a
freshly imaged node.

Basically, what I would do is
a) Install software on a node (take lots of notes on problems)
b) Install software in chroot on oscar image. (odd things will happen,
chroot is strange)
c) try reimaging a node, check software to see if it is right
d) reimage the cluster

This seems like a lot of bother for a small cluster, but like I said
before what you are saving is the effort of doing it the third time
down the road not the time you are saving now.

On Mon, Nov 3, 2008 at 5:38 PM, Ali Nazemian <[EMAIL PROTECTED]> wrote:
> Since we have just 22 node , so i think rescue disk would be reasonable.
> I'll do that.
> Also i want to install some software on nodes , such as Fluent , Ansys and
> etc... how can i put these packages in image file instead of installing them
> on nodes one by one?!
> Cheers.
>
> On Tue, Nov 4, 2008 at 1:05 AM, Michael Edwards <[EMAIL PROTECTED]> wrote:
>>
>> hd0 is just a label for one of the hard drives, that is fairly normal.
>>  For some reason the nodes aren't liking the drives that OSCAR picked
>> for your nodes to use.
>>
>> What you'll need to do is boot a rescue CD to a node and copy the
>> modprobe.conf file it uses (assuming it boots properly and can mount
>> the node's disk) to over write the one made by OSCAR
>>
>> Then use the tips here
>> (http://svn.oscar.openclustergroup.org/trac/oscar/wiki/TipLDAP) to
>> change the OSCAR files used when imaging the nodes for future nodes.
>>
>> Then depending on if you have a lot of nodes, you can reimage the
>> nodes or just repeat the rescue CD booting.  Fix the modprobe in the
>> images though, in case you reimage them later.
>>
>> Generally, the way I work is to make changes to a node, note what
>> changes I made that worked, update the image with those changes, then
>> reimmage the cluster if possible.  This depends on how many users you
>> have of course :)
>>
>> On Sun, Nov 2, 2008 at 4:08 AM, Ali Nazemian <[EMAIL PROTECTED]>
>> wrote:
>> > Sorry i forgot to write something in my last post , so add these
>> > sentences
>> > to that:
>> > on client node boot menu i have 2 choice( it's obvious) :
>> > 2.6.9-78.ELsmp_(hd0,0)
>> > 2.6.9-78.EL_(hd0,0)
>> > what hd0 means? i didnt saw something like this in Linux boot menu,
>> > anyway ,
>> > i tried to edit "2.6.9-78.ELsmp_(hd0,0)" command , and a new page with 3
>> > choices showed up:
>> > root(hd0,0)
>> > kernel /vmlinuz-2.6.9-78.ELsmp ro root=/dev/sda6
>> > initrd /sc-initrd-2.6.9-78.ELsmp.gz
>> > i checked server node , and for server , i have a boot menu like this:(
>> > edit
>> > command)
>> > root(hd0,0)
>> > kernel /vmlinuz-2.6.9-78.ELsmp ro root=/dev/VolGroup00/logvoI00 rhgb...
>> > initrd /initrd-2.6.9-78.Elsmp.img
>> > is this usuall ? or i am in trouble?
>> > FYI: i used UYOK method.
>> > cheers.
>> >
>> > On Sun, Nov 2, 2008 at 12:15 PM, Ali Nazemian <[EMAIL PROTECTED]>
>> > wrote:
>> >>
>> >> Hi again ,
>> >> I set image post action to "beep" and start installation process ,
>> >> after a
>> >> while it seems that installation process on client node , finished ,
>> >> and its
>> >> waiting for reboot , so i reset node client , after that i changed boot
>> >> priority to hard disk first , then node client tried to boot from hard
>> >> disk
>> >> and loading centos on it , but these errors showed up:
>> >>
>> >> Mounted /proc filesystem
>> >> Mounting sysfs
>> >> Creating /dev
>> >> Starting udev
>> >> Loading jbd.ko module
>> >> Loading ext3.ko module
>> >> Creating root device
>> >> Mounting root filesystem
>> >> mount:error 6 mounting ext3
>> >> mount:error 2 mounting none
>> >> Switching to new root
>> >> switchroot:mount failed:22
>> >> unmount /initrd/dev failed:2
>> >> kernel panic - not syncing : Attempted to kill init!
>> >> -------------------
>> >> on the server node , in monitoring page , still installation of this
>> >> node
>> >> is green and its progress is "beeping" , so its seems it was successful
>> >> until now.
>> >> what is the problem? what should i do now?!
>> >> cheers.
>> >>
>> >> On Sat, Nov 1, 2008 at 11:41 PM, Ali Nazemian <[EMAIL PROTECTED]>
>> >> wrote:
>> >>>
>> >>> I checked what u said , and it seems that firewall was enable ,
>> >>> although
>> >>> ssh was allowed , but that was enable , so this problem solved by
>> >>> disabling
>> >>> firewall , after that some new errors showed up on client node ,it was
>> >>> something about portioning problem in client node that i think it was
>> >>> related to ide.disk/scsi.disk file ,  so i have questions about
>> >>> imaging
>> >>> process in OSCAR installation, that probably can help me to install it
>> >>> without any errors:
>> >>> 1- in step "build the image" we should choose disk partion file , they
>> >>> said we should choose scsi.disk for scsi disks and ide.disk for IDE
>> >>> disks ,
>> >>> but what about SATA IDE disks? as u know hda partion format for IDE
>> >>> and sda
>> >>> is for scsi , i saw SATA is use sda too , so should i choose
>> >>> scsi.disk?!
>> >>> 2- in this step , we should ip assignment method , dafult value for
>> >>> that
>> >>> is static , which one should i choose?! static or dhcp?! which one is
>> >>> more
>> >>> efficeient? i think static is better, what do u think?!
>> >>> 3- post install action , should be reboot , beep or something else? in
>> >>> istallation manual it says we shouldn't choose reboot if we want to
>> >>> choose
>> >>> network boot installation, now i dont know which one is better and
>> >>> errorless
>> >>> for me?!
>> >>> 4- I found something , when i want to find mac address of the client
>> >>> node
>> >>> , ( i have just 2 node connected to switch as a pilot project , one of
>> >>> them
>> >>> as a server and another one as a clinet ) wrong mac address found , i
>> >>> think
>> >>> it is switch mac address that found , so i should insert client mac
>> >>> address
>> >>> manually , do u think it can cause some errors in installation
>> >>> process?!
>> >>> Best regards.
>> >>>
>> >>> On Fri, Oct 31, 2008 at 1:57 AM, <[EMAIL PROTECTED]> wrote:
>> >>>>
>> >>>> Have you checked the headnode, to make sure that your firewall is not
>> >>>> running?
>> >>>>
>> >>>> for a GUI to turn off the firewall: system-config-securitylevel
>> >>>>
>> >>>> On Tue, Oct 28, 2008 at 2:51 PM, ali nazemian <[EMAIL PROTECTED]>
>> >>>> wrote:
>> >>>>>
>> >>>>> Hi.
>> >>>>> I want to execute clustering for our HPC center using OSCAR, but i
>> >>>>> have
>> >>>>> a problem with step 7, installing cluster.
>> >>>>> Here is my problem :
>> >>>>> After i want to run step 7 , after some time on client node "tftp
>> >>>>> time
>> >>>>> out" error appeared and node terminate the boot agent. and "Received
>> >>>>> disconnect from 192.168.0.2: 2: The connection is closed by SSH
>> >>>>> Server
>> >>>>> Current FSM is SSH_Main_SSHProcess" appeared on server node.
>> >>>>> Here is the complete log of step 7:
>> >>>>>
>> >>>>>
>> >>>>> --------------------------------------------------------------------------
>> >>>>> --> Update Wizard Env (as needed)
>> >>>>> --> Step 7: Running: ./post_install
>> >>>>> Gathering processor count from oscarnode1.clusternet.
>> >>>>> ssh: connect to host oscarnode1.clusternet port 22: Connection timed
>> >>>>> out
>> >>>>> Improper count (0) returned from machine oscarnode1.clusternet at
>> >>>>> ./post_install line 83
>> >>>>>     main::get_numproc() called at ./post_install line 39
>> >>>>> ssh: connect to host oscarnode1 port 22: Connection timed out
>> >>>>> rsync: connection unexpectedly closed (0 bytes received so far)
>> >>>>> [sender]
>> >>>>> rsync error: error in rsync protocol data stream (code 12) at
>> >>>>> io.c(359)
>> >>>>> --> About to run /opt/oscar/packages/loghost/scripts/post_install
>> >>>>> for
>> >>>>> loghost
>> >>>>> ************************* oscar_cluster *************************
>> >>>>> --------- oscarnode1---------
>> >>>>> ssh: connect to host oscarnode1 port 22: Connection timed out
>> >>>>> --> About to run /opt/oscar/packages/ganglia/scripts/post_install
>> >>>>> for
>> >>>>> ganglia
>> >>>>> [ganglia] Ganglia gmond configuration file modified, re-starting
>> >>>>> daemon...
>> >>>>> Shutting down GANGLIA gmond: [60G[  [0;32mOK [0;39m  ]
>> >>>>> Starting GANGLIA gmond: [60G[  [0;32mOK [0;39m  ]
>> >>>>> editing /etc/gmetad.conf
>> >>>>> match: gridname\s+.*
>> >>>>> match: data_source\s+.*
>> >>>>> [ganglia] Ganglia gmetad configuration file modified, re-starting
>> >>>>> daemon...
>> >>>>> Shutting down GANGLIA gmetad: [60G[  [0;32mOK [0;39m  ]
>> >>>>> Starting GANGLIA gmetad: [60G[  [0;32mOK [0;39m  ]
>> >>>>> [ganglia] Starting up apache...
>> >>>>> Stopping httpd: [60G[  [0;32mOK [0;39m  ]
>> >>>>> Starting httpd: [60G[  [0;32mOK [0;39m  ]
>> >>>>> [ganglia] Ganglia page is located at
>> >>>>> http://server.clusternet/ganglia/
>> >>>>> ************************* oscar_cluster *************************
>> >>>>> --------- oscarnode1---------
>> >>>>> Received disconnect from 192.168.0.2: 2: The connection is closed by
>> >>>>> SSH Server
>> >>>>> Current FSM is SSH_Main_SSHProcess
>> >>>>> --> About to run /opt/oscar/packages/torque/scripts/post_install for
>> >>>>> torque
>> >>>>> Received disconnect from 192.168.0.2: 2: The connection is closed by
>> >>>>> SSH Server
>> >>>>> Current FSM is SSH_Main_SSHProcess
>> >>>>> rsync: connection unexpectedly closed (0 bytes received so far)
>> >>>>> [sender]
>> >>>>> rsync error: error in rsync protocol data stream (code 12) at
>> >>>>> io.c(359)
>> >>>>> TORQUE mom config file updated with clienthost: server.clusternet
>> >>>>> Pushing config file to clients...
>> >>>>> Sending SIGHUP to all moms...
>> >>>>> ************************* oscar_cluster *************************
>> >>>>> --------- oscarnode1---------
>> >>>>> Received disconnect from 192.168.0.2: 2: The connection is closed by
>> >>>>> SSH Server
>> >>>>> Current FSM is SSH_Main_SSHProcess
>> >>>>> [torque] Updating pbs_server nodes
>> >>>>> /opt/pbs/bin/pbsnodes: Server has no node list
>> >>>>> Shutting down TORQUE Server: [60G[  [0;32mOK [0;39m  ]
>> >>>>> Starting TORQUE Server: [60G[  [0;32mOK [0;39m  ]
>> >>>>> [torque] Creating TORQUE workq queue...
>> >>>>> Max open servers: 4
>> >>>>> set queue workq resources_max.ncpus = 0
>> >>>>> set queue workq resources_max.nodect = 0
>> >>>>> set queue workq resources_available.nodect = 0
>> >>>>> set server resources_available.ncpus = 0
>> >>>>> set server resources_available.nodect = 0
>> >>>>> set server resources_available.nodes = 0
>> >>>>> set server resources_max.ncpus = 0
>> >>>>> set server resources_max.nodes = 0
>> >>>>> set server scheduler_iteration = 60
>> >>>>> set server log_events = 64
>> >>>>> Shutting down MAUI Scheduler: [60G[  [0;32mOK [0;39m  ]
>> >>>>> Starting MAUI Scheduler: [60G[  [0;32mOK [0;39m  ]
>> >>>>> --> About to run /opt/oscar/packages/switcher/scripts/post_install
>> >>>>> for
>> >>>>> switcher
>> >>>>> Setting default for tag mpi ("lam-7.1.2")
>> >>>>> Attribute successfully set; new attribute setting will be effective
>> >>>>> for
>> >>>>> future shells
>> >>>>> Received disconnect from 192.168.0.2: 2: The connection is closed by
>> >>>>> SSH Server
>> >>>>> Current FSM is SSH_Main_SSHProcess
>> >>>>> rsync: connection unexpectedly closed (0 bytes received so far)
>> >>>>> [sender]
>> >>>>> rsync error: error in rsync protocol data stream (code 12) at
>> >>>>> io.c(359)
>> >>>>> --> About to run /opt/oscar/packages/mta-config/scripts/post_install
>> >>>>> for mta-config
>> >>>>> ************************************ WARNING
>> >>>>> ************************************
>> >>>>> OSCAR could not set up the configuration for any mailing service on
>> >>>>> the
>> >>>>> server.
>> >>>>> The current version of the mta-config package in OSCAR only supports
>> >>>>> the Postfix mail transfer agent (MTA).
>> >>>>> It looks like you have another MTA installed (e.g, sendmail or
>> >>>>> exim);
>> >>>>> as such,
>> >>>>> please be aware that OSCAR will not automatically configure it.
>> >>>>> ************************************ WARNING
>> >>>>> ************************************
>> >>>>> --> About to run /opt/oscar/packages/ntpconfig/scripts/post_install
>> >>>>> for
>> >>>>> ntpconfig
>> >>>>> Shutting down ntpd: [60G[  [0;32mOK [0;39m  ]
>> >>>>> Starting ntpd: [60G[  [0;32mOK [0;39m  ]
>> >>>>> ************************* oscar_cluster *************************
>> >>>>> --------- oscarnode1---------
>> >>>>> Received disconnect from 192.168.0.2: 2: The connection is closed by
>> >>>>> SSH Server
>> >>>>> Current FSM is SSH_Main_SSHProcess
>> >>>>> --> About to run /opt/oscar/packages/opium/scripts/post_install for
>> >>>>> opium
>> >>>>> Not all hosts were accessible by c3! Will retry the update later
>> >>>>> Could not find template for file switcher.ini
>> >>>>> If this contains distro-specific lines, please create a template!
>> >>>>> image:
>> >>>>> $VAR1 = 'oscarimage';
>> >>>>> ---------------
>> >>>>> Received disconnect from 192.168.0.2: 2: The connection is closed by
>> >>>>> SSH Server
>> >>>>> Current FSM is SSH_Main_SSHProcess
>> >>>>> rsync: connection unexpectedly closed (0 bytes received so far)
>> >>>>> [sender]
>> >>>>> rsync error: error in rsync protocol data stream (code 12) at
>> >>>>> io.c(359)
>> >>>>> Could not find template for file gshadow
>> >>>>> If this contains distro-specific lines, please create a template!
>> >>>>> image:
>> >>>>> $VAR1 = 'oscarimage';
>> >>>>> ---------------
>> >>>>> Received disconnect from 192.168.0.2: 2: The connection is closed by
>> >>>>> SSH Server
>> >>>>> Current FSM is SSH_Main_SSHProcess
>> >>>>> rsync: connection unexpectedly closed (0 bytes received so far)
>> >>>>> [sender]
>> >>>>> rsync error: error in rsync protocol data stream (code 12) at
>> >>>>> io.c(359)
>> >>>>> image:
>> >>>>> $VAR1 = 'oscarimage';
>> >>>>> ---------------
>> >>>>> Received disconnect from 192.168.0.2: 2: The connection is closed by
>> >>>>> SSH Server
>> >>>>> Current FSM is SSH_Main_SSHProcess
>> >>>>> rsync: connection unexpectedly closed (0 bytes received so far)
>> >>>>> [sender]
>> >>>>> rsync error: error in rsync protocol data stream (code 12) at
>> >>>>> io.c(359)
>> >>>>> image:
>> >>>>> $VAR1 = 'oscarimage';
>> >>>>> ---------------
>> >>>>> Received disconnect from 192.168.0.2: 2: The connection is closed by
>> >>>>> SSH Server
>> >>>>> Current FSM is SSH_Main_SSHProcess
>> >>>>> rsync: connection unexpectedly closed (0 bytes received so far)
>> >>>>> [sender]
>> >>>>> rsync error: error in rsync protocol data stream (code 12) at
>> >>>>> io.c(359)
>> >>>>> image:
>> >>>>> $VAR1 = 'oscarimage';
>> >>>>> ---------------
>> >>>>> Received disconnect from 192.168.0.2: 2: The connection is closed by
>> >>>>> SSH Server
>> >>>>> Current FSM is SSH_Main_SSHProcess
>> >>>>> rsync: connection unexpectedly closed (0 bytes received so far)
>> >>>>> [sender]
>> >>>>> rsync error: error in rsync protocol data stream (code 12) at
>> >>>>> io.c(359)
>> >>>>> --> About to run /opt/oscar/packages/oda/scripts/post_install for
>> >>>>> oda
>> >>>>> generating the /etc/odaserver file on all oscar clients
>> >>>>> . /etc/profile.d/c3.sh && cexec 'echo oscar_server > /etc/odaserver'
>> >>>>> ************************* oscar_cluster *************************
>> >>>>> --------- oscarnode1---------
>> >>>>> Received disconnect from 192.168.0.2: 2: The connection is closed by
>> >>>>> SSH Server
>> >>>>> Current FSM is SSH_Main_SSHProcess
>> >>>>> Cluster setup complete!
>> >>>>> --> Step 7: Successfully completed the cluster install
>> >>>>> --> Update Wizard Env (as needed)
>> >>>>>
>> >>>>>
>> >>>>> -----------------------------------------------------------------------------------
>> >>>>> P.S: i am using OSCAR 5 on centos 4.7-x86_64 , i cant use centos 5.X
>> >>>>> because of it has problem with my graphic cards.
>> >>>>> Best regards.
>> >>>>>
>> >>>>> --
>> >>>>> A.Nazemian
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> -------------------------------------------------------------------------
>> >>>>> This SF.Net email is sponsored by the Moblin Your Move Developer's
>> >>>>> challenge
>> >>>>> Build the coolest Linux based applications with Moblin SDK & win
>> >>>>> great
>> >>>>> prizes
>> >>>>> Grand prize is a trip for two to an Open Source event anywhere in
>> >>>>> the
>> >>>>> world
>> >>>>> http://moblin-contest.org/redirect.php?banner_id=100&url=/
>> >>>>> _______________________________________________
>> >>>>> Oscar-users mailing list
>> >>>>> Oscar-users@lists.sourceforge.net
>> >>>>> https://lists.sourceforge.net/lists/listinfo/oscar-users
>> >>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> -------------------------------------------------------------------------
>> >>>> This SF.Net email is sponsored by the Moblin Your Move Developer's
>> >>>> challenge
>> >>>> Build the coolest Linux based applications with Moblin SDK & win
>> >>>> great
>> >>>> prizes
>> >>>> Grand prize is a trip for two to an Open Source event anywhere in the
>> >>>> world
>> >>>> http://moblin-contest.org/redirect.php?banner_id=100&url=/
>> >>>> _______________________________________________
>> >>>> Oscar-users mailing list
>> >>>> Oscar-users@lists.sourceforge.net
>> >>>> https://lists.sourceforge.net/lists/listinfo/oscar-users
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> A.Nazemian
>> >>
>> >>
>> >>
>> >> --
>> >> A.Nazemian
>> >
>> >
>> >
>> > --
>> > A.Nazemian
>> >
>> >
>> > -------------------------------------------------------------------------
>> > This SF.Net email is sponsored by the Moblin Your Move Developer's
>> > challenge
>> > Build the coolest Linux based applications with Moblin SDK & win great
>> > prizes
>> > Grand prize is a trip for two to an Open Source event anywhere in the
>> > world
>> > http://moblin-contest.org/redirect.php?banner_id=100&url=/
>> > _______________________________________________
>> > Oscar-users mailing list
>> > Oscar-users@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/oscar-users
>> >
>> >
>>
>> -------------------------------------------------------------------------
>> This SF.Net email is sponsored by the Moblin Your Move Developer's
>> challenge
>> Build the coolest Linux based applications with Moblin SDK & win great
>> prizes
>> Grand prize is a trip for two to an Open Source event anywhere in the
>> world
>> http://moblin-contest.org/redirect.php?banner_id=100&url=/
>> _______________________________________________
>> Oscar-users mailing list
>> Oscar-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/oscar-users
>
>
>
> --
> A.Nazemian
>
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great
> prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Oscar-users mailing list
> Oscar-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oscar-users
>
>

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to