Hello,
Am Dienstag, den 15.01.2008, 15:47 +0100 schrieb Dominik Schips:
> Hello DongInn,
>
> Am Dienstag, den 15.01.2008, 08:07 -0500 schrieb DongInn Kim:
> > Hi Dominik,
> >
> > Well, the OSCAR OpenMPI requires to have "tm" support. Please recompile
> > OpenMPI on the current cluster which torque and torque-devel rpms installed.
> > (e.g., rpmbuild --rebuild --define "oscar 1" --define "_packager Dominik
> > Ships <[EMAIL PROTECTED]>" --define "_vendor OSCAR" --define
> > "configure_options --with-tm=/opt/pbs" --target x86_64
> > openmpi-1.2.4.src.rpm )
>
> Ok, I'll recompile the openmpi to get tm support withg your description.
Ok, the errors are gone after I used a version with tm support. I have
to use the actual OpenMPI 1.2.5. I just recompiled the src without
problems. Small changes to the config.xml and build a
openmpi-switcher-modulefile-1.2.5-1 package for the module.
Step8 give me only 2 errors. (see logs below)
Node1 is working (job scheduling) but node2 is terminating the jobs he
get from the headnode. (see logs below)
Any advice/idea why only one is taking the jobs?
I have to say that it is no "from scratch" SLES10SP1 (with OSCAR 5.0)
installation.
So what should this error message would tell me?
sles10oscar:/home/oscartst/openmpi # cat openmpitest.err
[oscarnode2:04863] pls:tm: failed to poll for a spawned proc, return
status = 17002
[oscarnode2:04863] [0,0,0] ORTE_ERROR_LOG: In errno in file rmgr_urm.c
at line 462
[oscarnode2:04863] mpiexec: spawn failed with errno=-11
Error from the LAM/MPI Test:
Looks like tm is missing also in lam. For me OpenMPI has priority.
sles10oscar:/home/oscartst/lam # cat lamtest.err
ERROR: LAM/MPI does not appear to have the tm boot SSI module!
This test script will now abort.
Performing root tests...
Maui service check:maui [PASSED]
TORQUE node check [PASSED]
TORQUE service check:pbs_server [PASSED]
/home mounts [PASSED]
Preparing user tests...
Performing user tests...
SSH ping test [PASSED]
SSH server->node [PASSED]
SSH node->server [PASSED]
LAM/MPI (via TORQUE) [FAILED]
PVM (via TORQUE) [PASSED]
MPICH (via TORQUE) [PASSED]
Ganglia setup test [PASSED]
Ganglia node count test [PASSED]
TORQUE default queue definition [PASSED]
TORQUE Shell Test [PASSED]
Open MPI (via TORQUE) [FAILED]
Run APItests...
Running Installation tests for pvm
[PASS] 2008-01-18 14:26:40 pvmd-path-ls.apt
[PASS] 2008-01-18 14:26:41 envvar-pvm_arch.apt
[PASS] 2008-01-18 14:26:41 envvar-pvm_root.apt
[PASS] 2008-01-18 14:26:41 envvar.apb
[PASS] 2008-01-18 14:26:41 pvmd-path-which.apt
[PASS] 2008-01-18 14:26:41 modulecmd-path-ls.apt
[PASS] 2008-01-18 14:26:41 pvm-module-list.apt
[PASS] 2008-01-18 14:26:41 pvm-module-show-pvm_rsh.apt
[PASS] 2008-01-18 14:26:41 pvm-module-show-pvm_arch.apt
[PASS] 2008-01-18 14:26:41 pvm-module-show-pvm_root.apt
[PASS] 2008-01-18 14:26:41 pvm-module-show.apb
[PASS] 2008-01-18 14:26:41 pvm-module.apb
[PASS] 2008-01-18 14:26:41 install_tests.apb
There are 2 failed/skipped tests (see above).
Please check for .err and .out files in /home/oscartst/<package>.
...Hit <ENTER> to close this window...
Node 1:
oscarnode1:/var/spool/pbs/mom_logs # tail -f 20080118
01/18/2008 14:25:30;0002; pbs_mom;Svr;Log;Log opened
01/18/2008 14:25:30;0002; pbs_mom;Svr;usecp;sles10oscar:/home /home
01/18/2008 14:25:30;0002; pbs_mom;Svr;restricted;sles10oscar
01/18/2008 14:25:30;0002; pbs_mom;n/a;initialize;independent
01/18/2008 14:25:30;0002; pbs_mom;Svr;pbs_mom;Is up
01/18/2008 14:25:30;0002; pbs_mom;n/a;mom_main;hello sent to server
sles10oscar
01/18/2008 14:26:21;0008; pbs_mom;Job;66.sles10oscar;JOIN JOB as node
1
01/18/2008 14:26:23;0008; pbs_mom;Job;67.sles10oscar;JOIN JOB as node
1
01/18/2008 14:26:28;0008; pbs_mom;Job;68.sles10oscar;JOIN JOB as node
1
01/18/2008 14:26:37;0008; pbs_mom;Job;69.sles10oscar;JOIN JOB as node
1
01/18/2008 14:26:37;0008; pbs_mom;Job;69.sles10oscar;start_process:
task started, tid 3, sid 4740, cmd hostname
01/18/2008 14:26:37;0008; pbs_mom;Job;69.sles10oscar;start_process:
task started, tid 5, sid 4741, cmd date
01/18/2008 14:26:39;0008; pbs_mom;Job;70.sles10oscar;JOIN JOB as node
1
Node 2:
01/18/2008 14:26:21;0008; pbs_mom;Job;66.sles10oscar;kill_task:
killing pid 4115 task 1 with sig 9
01/18/2008 14:26:21;0008; pbs_mom;Job;66.sles10oscar;Terminated
01/18/2008 14:26:24;0001; pbs_mom;Job;TMomFinalizeJob3;job
67.sles10oscar started, pid = 4173
01/18/2008 14:26:24;0008; pbs_mom;Job;67.sles10oscar;Job Modified at
request of [EMAIL PROTECTED]
01/18/2008 14:26:25;0008; pbs_mom;Job;67.sles10oscar;kill_task:
killing pid 4174 task 1 with sig 9
01/18/2008 14:26:25;0008; pbs_mom;Job;67.sles10oscar;Terminated
01/18/2008 14:26:29;0001; pbs_mom;Job;TMomFinalizeJob3;job
68.sles10oscar started, pid = 4316
01/18/2008 14:26:29;0008; pbs_mom;Job;68.sles10oscar;Job Modified at
request of [EMAIL PROTECTED]
01/18/2008 14:26:34;0008; pbs_mom;Job;68.sles10oscar;kill_task:
killing pid 4317 task 1 with sig 9
01/18/2008 14:26:34;0008; pbs_mom;Job;68.sles10oscar;Terminated
01/18/2008 14:26:37;0001; pbs_mom;Job;TMomFinalizeJob3;job
69.sles10oscar started, pid = 4679
01/18/2008 14:26:37;0008; pbs_mom;Job;69.sles10oscar;Job Modified at
request of [EMAIL PROTECTED]
01/18/2008 14:26:37;0008; pbs_mom;Job;69.sles10oscar;start_process:
task started, tid 2, sid 4731, cmd hostname
01/18/2008 14:26:37;0008; pbs_mom;Job;69.sles10oscar;start_process:
task started, tid 4, sid 4733, cmd date
01/18/2008 14:26:37;0008; pbs_mom;Job;69.sles10oscar;kill_task:
killing pid 4680 task 1 with sig 9
01/18/2008 14:26:37;0008; pbs_mom;Job;69.sles10oscar;Terminated
01/18/2008 14:26:40;0001; pbs_mom;Job;TMomFinalizeJob3;job
70.sles10oscar started, pid = 4794
01/18/2008 14:26:40;0008; pbs_mom;Job;70.sles10oscar;Job Modified at
request of [EMAIL PROTECTED]
01/18/2008 14:26:40;0001; pbs_mom;Svr;pbs_mom;Bad file descriptor (9)
in tm_request, bad header Negative sign on an unsigned datum
01/18/2008 14:26:40;0008; pbs_mom;Job;70.sles10oscar;kill_task:
killing pid 4795 task 1 with sig 9
01/18/2008 14:26:40;0008; pbs_mom;Job;70.sles10oscar;Terminated
--
Mit freundlichen Grüßen / Best regards
Dominik Schips
Tel.: +49 (0)21 61 - 46 43-112
Fax: +49 (0)21 61 - 46 43-100
credativ GmbH, HRB Mönchengladbach 12080
Hohenzollernstr. 133, 41061 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Oscar-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-devel