From: carlos vasco [mailto:[EMAIL PROTECTED]
Sent: Tue 28/03/2006 22:23
To: Bernard Li
Cc: [email protected]
Subject: Re: [Oscar-devel] Re: Oscar 4.2.1b5 testing
Hello, Bernard,
Yesterday I did something similar, I
updated rpms on headnode and
(cexec) on nodes, and created a new images for
new nodes (we have to
grow the cluster from 10 to 40 this week). Before
building the new
image I updated the rpms on /tftpboot/rpm.
This
worked fine for Torque, but maui daemon stopped just after been
started (as
the Jerome case). After that I updated also maui from the
trunk (this time
only the rpm on the headnode), and everything looks
OK now, including the
time use issue.
I am sorry to tell you that this should be a producction
cluster by
the beginning of next week, beacause we are in a hurry now, and
we
have been very delayed by the network cards problem. I don't know if
I
will be able to find the time and place to check b6 in 2 of the
nodes
not yet added. Can we know what are the fixes on this
beta?
Today I have to recheck again our cluster, I will report about
that.
Thanks again,
Carlos
On 3/28/06, Bernard Li
<[EMAIL PROTECTED]> wrote:
> Presuming this is not a production
cluster...
>
> You should be able to upgrade the RPMs on top of your
existing RPMs,
> however you need to do a few things:
>
> -
update RPMs on headnode
> - update RPMs on image (rpm -ivhr
>
/var/lib/systemimager/images/oscarimage <rpms>)
> - re-push image or
use cexec to update RPMs on compute nodes
> - copy updated RPMs to
/tftpboot/rpm
> - re-start the daemons if that is not automatically
done
>
> See if that works for you.
>
> P.S. OSCAR
4.2.1b6 should be cut today, would appreciate it if you give
> that a
whirl as well.
>
> Thanks,
>
> Bernard
>
>
> -----Original Message-----
> > From:
[EMAIL PROTECTED]
> > [mailto:[EMAIL PROTECTED]]
On Behalf Of
> > carlos vasco
> > Sent: Tuesday, March 28,
2006 2:09
> > To: Bernard Li
> > Cc:
[email protected]
> > Subject: [Oscar-devel] Re: Oscar
4.2.1b5 testing
> >
> > Thanks, Bernard, but what should the
way to update this rpms
> > (reinstall Oscar or can be done over that
installation)?
> >
> > Carlos
> >
> > On
3/28/06, Bernard Li <[EMAIL PROTECTED]> wrote:
> > >
> >
>
> > > Hi Carlos:
> > >
> > > You can
use the newer TORQUE RPMs from trunk:
> > >
> > > http://svn.oscar.openclustergroup.org/oscar/trunk/packages/torque/
>
> >
> > > Please let us know if this fixes your
problem.
> > >
> > > Cheers,
> > >
>
> > Bernard
> > >
> > >
________________________________
> > >
> > > From:
carlos vasco [mailto:[EMAIL PROTECTED]]
>
> > Sent: Tue 28/03/2006 01:55
> > > To: Bernard Li
>
> > Cc: [email protected]
> > > Subject: Re:
Oscar 4.2.1b5 testing
> > >
> > >
> >
>
> > >
> > > Hi Bernard,
> > >
>
> > I have been searching the torque list about the time use
> >
issue, and found
> > > that:
> > >
> > >
>> We are using torque-1.2.0p5 and Maui-3.2.6p13. When I do
> > a
qstat I
> > > >> see that the 'Time Use' is only a couple of
seconds, yet the jobs
> > > >> have been running for a couple
of hours. We are running
> > Matlab jobs
> > > >>
which are launched from a script. They are only single
> > cpu (no
mpi).
> > > >
> > > > This is a bug fixed in
1.2.0p6...
> > >
> > > Since my problem is very similar
(not a mpi issue), and oscar 4.2.1
> > > being torque-1.2.0p5 (I
think), the solution could be using
> > > torque-1.2.0p6. Any easy
way to update torque?
> > >
> > > Thanks,
> >
> Carlos
> > >
> > > On 3/28/06, carlos vasco
<[EMAIL PROTECTED]> wrote:
> > > > Hi Bernard (and
oscar-devel, I forgot last time to cc them):
> > > >
> >
> > The test now worked OK, apart from ganglia, but this has been
>
> > > reconfigured by our IT people, so it should be ok.
> >
> >
> > > > TORQUE still reports 00:00:00 ...
> >
> >
> > > > Carlos
> > > >
> > >
> On 3/28/06, carlos vasco <[EMAIL PROTECTED]> wrote:
> >
> > > Hi Bernard (and oscar-devel, I forgot last time to cc
them):
> > > > >
> > > > > The test now
worked OK, apart from ganglia, but this has been
> > > > >
reconfigured by our IT people, so it should be ok.
> > > >
>
> > > > > TORQUE still reports 00:00:00 ...
> >
> > >
> > > > > Carlos
> > > >
>
> > > > > On 3/28/06, carlos vasco
<[EMAIL PROTECTED]> wrote:
> > > > > > Not sure
what I did, but apparently installing Oscar
> > I modified
> >
> > > > ssh_config on the server instead of sshd_config, and
>
> that is why I
> > > > > > think I forgot tho modified
sshd_config.
> > > > > >
> > > > > > I
am trying the test again.
> > > > > >
> > >
> > > Carlos
> > > > > >
> > > >
> > On 3/28/06, Bernard Li <[EMAIL PROTECTED]> wrote:
> > >
> > > >
> > > > > > >
> > > >
> > > [ CC:ing oscar-devel on this ]
> > > > > >
>
> > > > > > > You shouldn't need to manually modify
your
> > /etc/ssh/sshd_config to
> > > add
> >
> > > the
> > > > > > > "PermitRootLogin" -
this should already be done for you.
> > > > > >
>
> > > > > > > In your error log, it indicates that
you have put
> > the option in
> > > > > > >
/etc/ssh/ssh_config, which is _wrong_. Try taking
> > out that
line and
> > > > > re-run
> > > > > >
> the tests (that option should be in sshd_config,
> > not
ssh_config, but
> > > as
> > > > > I
> >
> > > > > mentioned you shouldn't need to manually modify
it).
> > > > > > >
> > > > > > >
Cheers,
> > > > > > >
> > > > > >
> Bernard
> > > > > > >
> > > > >
> > ________________________________
> > > > >
> > From: carlos vasco [mailto:[EMAIL PROTECTED]]
>
> > > > > > Sent: Mon 27/03/2006 22:18
> > > >
> > > To: Bernard Li
> > > > > > > Subject: Re:
[Oscar-devel] Oscar 4.2.1b5 testing
> > > > > >
>
> > > > > > >
> > > > > >
>
> > > > > > >
> > > > > > >
Hi, Bernard,
> > > > > > >
> > > > >
> > I don't know exactly what the logs are, I only can find the
>
> > following
> > > > > > > in the /home/oscartst/
directory:
> > > > > > >
> > > > >
> > drwxr-xr-x 2 oscartst oscartst 4096 Mar 27 15:30 ganglia
>
> > > > > > drwxr-xr-x 2 oscartst oscartst 4096 Mar 27
15:31 lam
> > > > > > > drwxr-xr-x 2 oscartst
oscartst 4096 Mar 22 16:49 maui
> > > > > > >
drwxr-xr-x 2 oscartst oscartst 4096 Mar 27 15:30 mpich
> > >
> > > > -rwxr-xr-x 1 oscartst oscartst 4826 Mar 27 15:30
pbs_test
> > > > > > > drwxr-xr-x 2 oscartst
oscartst 4096 Mar 27 15:30 pvm
> > > > > > >
-rwxr-xr-x 1 oscartst oscartst 927 Mar 27 15:30
> >
ssh_user_tests
> > > > > > > -rwxr-xr-x 1 oscartst
oscartst 7326 Mar 27 15:30
> > test_cluster
> > > > >
> > -rwxr-xr-x 1 oscartst oscartst 3562 Mar 27 15:30
testprint
> > > > > > > drwxr-xr-x 2 oscartst
oscartst 4096 Mar 27 15:30 torque
> > > > > > >
>
> > > > > > In mpich,
> > > > > > >
-rw-r--r-- 1 oscartst oscartst 3093 Mar 18 00:30 cpi.c
>
> > > > > > -rw-r--r-- 1 oscartst oscartst
1732 Mar 18 00:30
> > cxxhello.cc
> > > > > > >
-rw-r--r-- 1 oscartst oscartst 1647 Mar 18 00:30
> >
f77hello.f
> > > > > > > -rwxrwxr-x 1 oscartst
oscartst 337512 Mar 27 15:30
> > mpich-cpi
> > > > >
> > -rw------- 1 oscartst oscartst 136 Mar 27
15:30
> > mpichtest.err
> > > > > > >
-rw------- 1 oscartst oscartst 454 Mar 27 15:30
>
> mpichtest.out
> > > > > > > -rwxr-xr-x 1
oscartst oscartst 1412 Mar 18 00:30
> >
pbs_script.mpich
> > > > > > > -rw-rw-r-- 1
oscartst oscartst 510 Mar 27 13:51 PI21051
> > >
> > > > -rw-rw-r-- 1 oscartst oscartst 510
Mar 27 12:37 PI3408
> > > > > > > -rwxr-xr-x 1
oscartst oscartst 2837 Mar 18 00:30
> > test_user
>
> > > > > >
> > > > > > > I attach the
mpichtest files.
> > > > > > >
> > > >
> > > Not sure how to track the TORQUE problem, maybe I
> >
can config it in
> > > the
> > > > > > >
same way we configured the other clusters.
> > > > > >
>
> > > > > > > Thanks,
> > > > >
> > Carlos
> > > > > > >
> > > >
> > >
> > > > > > > On 3/27/06, Bernard Li
<[EMAIL PROTECTED]> wrote:
> > > > > > > > Hi
Carlos:
> > > > > > > >
> > > > >
> > > > No problems have been found during
> >
installation, but some errors
> > > > > did
> > >
> > > > > > occur during the test phase (see
attachment).
> > > > > > > >
> > > >
> > > > Can you post the relevant logs in /home/oscartst?
>
> > > > > > >
> > > > > > > >
> Other problem found is that qstat reports
> > 00:00:00 in
the
> > > > > > > > > Time Use field.
> >
> > > > > >
> > > > > > > > I
wonder if this is a TORQUE bug or a bug of us
> > setting it up -
do
> > > > > you
> > > > > > > >
think you can dig deeper into this?
> > > > > > >
>
> > > > > > > > > During installation, I
think I forgot to put
> > PermitRootLogin yes
> > >
in
> > > > > > > > > sshd_config, and after the
nodes were created,
> > I cpushed the
> > > > >
corrected
> > > > > > > > > sshd_config file.
Could these be related with
> > the errors?
> > > > >
> > >
> > > > > > > > You shouldn't need to
edit sshd_config manually -
> > anyways, we
> > >
should
> > > > > be
> > > > > > > >
able to figure out what's wrong by investigating
> > the log
files.
> > > > > > > >
> > > > >
> > > Thanks,
> > > > > > > >
> >
> > > > > > Bernard
> > > > > > >
>
> > > > > > >
> > > > >
>
> > > > >
> > > >
> >
>
> >
> >
> >
-------------------------------------------------------
> > This SF.Net
email is sponsored by xPML, a groundbreaking
> > scripting
language
> > that extends applications into web and mobile media.
Attend
> > the live webcast
> > and join the prime developer
group breaking into this new
> > coding territory!
> > http://sel.as-us.falkag.net/sel?cmd=k&kid0944&bid$1720&dat1642
>
> _______________________________________________
> > Oscar-devel
mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/oscar-devel
>
>
>
