Hi

I've finally managed to fix the problems I had!!! Apart from the MPI problem 
I've previously reported the same occurred when submitting single jobs (not 
MPI). Again, just the client node executed the jobs. This was caused because 
the 
file "/var/spool/pbs/server_priv/nodes" didn't include the server's hostname and
because the pbs_mom wasn't set to execute at startup. After adding the server's
hostname in the file (or add it with `qmgr`) and starting pbs_mom, everything
seemed to work fine for regular job submission.

I hoped that fixing this problem would fix the one in MPI also, but no. I kept
looking and managed to advance a little bit more. It looks like that there were
several things missing when I installed OSCAR. I found out the MPI libraries
("lam-libs.x86_64") weren't installed on the client node (Is this normal? I
thought the OSCAR image had all the needed libraries..) and that the
LD_LIBRARY_PATH environment variable wasn't set. I installed the MPI libraries
and set the variable and the MPI worked BUT just on the client node!!!

Now the problem had to be with LAM/MPI. Surfed through the site and found out 
that the $PBS_NODEFILE must include ALL computation nodes. I added the server 
hostname to a new file (I cannot change the original one since it it 
dynamically 
generated) but still no good. At last the problem was that the LAM/MPI version 
that comes with Fedora doesn't support the "ssi boot tm" option so I just had 
to 
change the "ssi boot" to "rsh". In the end one just has to boot LAM/MPI with 
the 
command: "lamboot -ssi boot rsh -v node.file".

In resume:
/var/spool/pbs/server_priv/nodes - must include all execution nodes
pbs_mom - run at startup on every execution host (server included)
lam-libs.x86_64 - install in every host
$LD_LIBRARY_PATH - set to include MPI libraries
$PBS_NODEFILE file - include all execution hosts hostname
PBS script  - use "lamboot -ssi boot rsh -v node.file" instead of the command 
presented on the samples


hope someone fixes this issues on the next OSCAR release.
FG

PS - some of this fixes may be inaccurate but it was how I managed to put the 
OSCAR cluster to work. I would appreciate if someone from the OSCAR development 
team could check them.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to