Hello, Bernard, Yesterday I did something similar, I updated rpms on headnode and (cexec) on nodes, and created a new images for new nodes (we have to grow the cluster from 10 to 40 this week). Before building the new image I updated the rpms on /tftpboot/rpm.
This worked fine for Torque, but maui daemon stopped just after been started (as the Jerome case). After that I updated also maui from the trunk (this time only the rpm on the headnode), and everything looks OK now, including the time use issue. I am sorry to tell you that this should be a producction cluster by the beginning of next week, beacause we are in a hurry now, and we have been very delayed by the network cards problem. I don't know if I will be able to find the time and place to check b6 in 2 of the nodes not yet added. Can we know what are the fixes on this beta? Today I have to recheck again our cluster, I will report about that. Thanks again, Carlos On 3/28/06, Bernard Li <[EMAIL PROTECTED]> wrote: > Presuming this is not a production cluster... > > You should be able to upgrade the RPMs on top of your existing RPMs, > however you need to do a few things: > > - update RPMs on headnode > - update RPMs on image (rpm -ivhr > /var/lib/systemimager/images/oscarimage <rpms>) > - re-push image or use cexec to update RPMs on compute nodes > - copy updated RPMs to /tftpboot/rpm > - re-start the daemons if that is not automatically done > > See if that works for you. > > P.S. OSCAR 4.2.1b6 should be cut today, would appreciate it if you give > that a whirl as well. > > Thanks, > > Bernard > > > -----Original Message----- > > From: [EMAIL PROTECTED] > > [mailto:[EMAIL PROTECTED] On Behalf Of > > carlos vasco > > Sent: Tuesday, March 28, 2006 2:09 > > To: Bernard Li > > Cc: [email protected] > > Subject: [Oscar-devel] Re: Oscar 4.2.1b5 testing > > > > Thanks, Bernard, but what should the way to update this rpms > > (reinstall Oscar or can be done over that installation)? > > > > Carlos > > > > On 3/28/06, Bernard Li <[EMAIL PROTECTED]> wrote: > > > > > > > > > Hi Carlos: > > > > > > You can use the newer TORQUE RPMs from trunk: > > > > > > http://svn.oscar.openclustergroup.org/oscar/trunk/packages/torque/ > > > > > > Please let us know if this fixes your problem. > > > > > > Cheers, > > > > > > Bernard > > > > > > ________________________________ > > > > > > From: carlos vasco [mailto:[EMAIL PROTECTED] > > > Sent: Tue 28/03/2006 01:55 > > > To: Bernard Li > > > Cc: [email protected] > > > Subject: Re: Oscar 4.2.1b5 testing > > > > > > > > > > > > > > > Hi Bernard, > > > > > > I have been searching the torque list about the time use > > issue, and found > > > that: > > > > > > >> We are using torque-1.2.0p5 and Maui-3.2.6p13. When I do > > a qstat I > > > >> see that the 'Time Use' is only a couple of seconds, yet the jobs > > > >> have been running for a couple of hours. We are running > > Matlab jobs > > > >> which are launched from a script. They are only single > > cpu (no mpi). > > > > > > > > This is a bug fixed in 1.2.0p6... > > > > > > Since my problem is very similar (not a mpi issue), and oscar 4.2.1 > > > being torque-1.2.0p5 (I think), the solution could be using > > > torque-1.2.0p6. Any easy way to update torque? > > > > > > Thanks, > > > Carlos > > > > > > On 3/28/06, carlos vasco <[EMAIL PROTECTED]> wrote: > > > > Hi Bernard (and oscar-devel, I forgot last time to cc them): > > > > > > > > The test now worked OK, apart from ganglia, but this has been > > > > reconfigured by our IT people, so it should be ok. > > > > > > > > TORQUE still reports 00:00:00 ... > > > > > > > > Carlos > > > > > > > > On 3/28/06, carlos vasco <[EMAIL PROTECTED]> wrote: > > > > > Hi Bernard (and oscar-devel, I forgot last time to cc them): > > > > > > > > > > The test now worked OK, apart from ganglia, but this has been > > > > > reconfigured by our IT people, so it should be ok. > > > > > > > > > > TORQUE still reports 00:00:00 ... > > > > > > > > > > Carlos > > > > > > > > > > On 3/28/06, carlos vasco <[EMAIL PROTECTED]> wrote: > > > > > > Not sure what I did, but apparently installing Oscar > > I modified > > > > > > ssh_config on the server instead of sshd_config, and > > that is why I > > > > > > think I forgot tho modified sshd_config. > > > > > > > > > > > > I am trying the test again. > > > > > > > > > > > > Carlos > > > > > > > > > > > > On 3/28/06, Bernard Li <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > > > > > > > > [ CC:ing oscar-devel on this ] > > > > > > > > > > > > > > You shouldn't need to manually modify your > > /etc/ssh/sshd_config to > > > add > > > > > the > > > > > > > "PermitRootLogin" - this should already be done for you. > > > > > > > > > > > > > > In your error log, it indicates that you have put > > the option in > > > > > > > /etc/ssh/ssh_config, which is _wrong_. Try taking > > out that line and > > > > > re-run > > > > > > > the tests (that option should be in sshd_config, > > not ssh_config, but > > > as > > > > > I > > > > > > > mentioned you shouldn't need to manually modify it). > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > Bernard > > > > > > > > > > > > > > ________________________________ > > > > > > > From: carlos vasco [mailto:[EMAIL PROTECTED] > > > > > > > Sent: Mon 27/03/2006 22:18 > > > > > > > To: Bernard Li > > > > > > > Subject: Re: [Oscar-devel] Oscar 4.2.1b5 testing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, Bernard, > > > > > > > > > > > > > > I don't know exactly what the logs are, I only can find the > > > following > > > > > > > in the /home/oscartst/ directory: > > > > > > > > > > > > > > drwxr-xr-x 2 oscartst oscartst 4096 Mar 27 15:30 ganglia > > > > > > > drwxr-xr-x 2 oscartst oscartst 4096 Mar 27 15:31 lam > > > > > > > drwxr-xr-x 2 oscartst oscartst 4096 Mar 22 16:49 maui > > > > > > > drwxr-xr-x 2 oscartst oscartst 4096 Mar 27 15:30 mpich > > > > > > > -rwxr-xr-x 1 oscartst oscartst 4826 Mar 27 15:30 pbs_test > > > > > > > drwxr-xr-x 2 oscartst oscartst 4096 Mar 27 15:30 pvm > > > > > > > -rwxr-xr-x 1 oscartst oscartst 927 Mar 27 15:30 > > ssh_user_tests > > > > > > > -rwxr-xr-x 1 oscartst oscartst 7326 Mar 27 15:30 > > test_cluster > > > > > > > -rwxr-xr-x 1 oscartst oscartst 3562 Mar 27 15:30 testprint > > > > > > > drwxr-xr-x 2 oscartst oscartst 4096 Mar 27 15:30 torque > > > > > > > > > > > > > > In mpich, > > > > > > > -rw-r--r-- 1 oscartst oscartst 3093 Mar 18 00:30 cpi.c > > > > > > > -rw-r--r-- 1 oscartst oscartst 1732 Mar 18 00:30 > > cxxhello.cc > > > > > > > -rw-r--r-- 1 oscartst oscartst 1647 Mar 18 00:30 > > f77hello.f > > > > > > > -rwxrwxr-x 1 oscartst oscartst 337512 Mar 27 15:30 > > mpich-cpi > > > > > > > -rw------- 1 oscartst oscartst 136 Mar 27 15:30 > > mpichtest.err > > > > > > > -rw------- 1 oscartst oscartst 454 Mar 27 15:30 > > mpichtest.out > > > > > > > -rwxr-xr-x 1 oscartst oscartst 1412 Mar 18 00:30 > > pbs_script.mpich > > > > > > > -rw-rw-r-- 1 oscartst oscartst 510 Mar 27 13:51 PI21051 > > > > > > > -rw-rw-r-- 1 oscartst oscartst 510 Mar 27 12:37 PI3408 > > > > > > > -rwxr-xr-x 1 oscartst oscartst 2837 Mar 18 00:30 > > test_user > > > > > > > > > > > > > > I attach the mpichtest files. > > > > > > > > > > > > > > Not sure how to track the TORQUE problem, maybe I > > can config it in > > > the > > > > > > > same way we configured the other clusters. > > > > > > > > > > > > > > Thanks, > > > > > > > Carlos > > > > > > > > > > > > > > > > > > > > > On 3/27/06, Bernard Li <[EMAIL PROTECTED]> wrote: > > > > > > > > Hi Carlos: > > > > > > > > > > > > > > > > > No problems have been found during > > installation, but some errors > > > > > did > > > > > > > > > occur during the test phase (see attachment). > > > > > > > > > > > > > > > > Can you post the relevant logs in /home/oscartst? > > > > > > > > > > > > > > > > > Other problem found is that qstat reports > > 00:00:00 in the > > > > > > > > > Time Use field. > > > > > > > > > > > > > > > > I wonder if this is a TORQUE bug or a bug of us > > setting it up - do > > > > > you > > > > > > > > think you can dig deeper into this? > > > > > > > > > > > > > > > > > During installation, I think I forgot to put > > PermitRootLogin yes > > > in > > > > > > > > > sshd_config, and after the nodes were created, > > I cpushed the > > > > > corrected > > > > > > > > > sshd_config file. Could these be related with > > the errors? > > > > > > > > > > > > > > > > You shouldn't need to edit sshd_config manually - > > anyways, we > > > should > > > > > be > > > > > > > > able to figure out what's wrong by investigating > > the log files. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Bernard > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by xPML, a groundbreaking > > scripting language > > that extends applications into web and mobile media. Attend > > the live webcast > > and join the prime developer group breaking into this new > > coding territory! > > http://sel.as-us.falkag.net/sel?cmd=k&kid0944&bid$1720&dat1642 > > _______________________________________________ > > Oscar-devel mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/oscar-devel > > > ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642 _______________________________________________ Oscar-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/oscar-devel
