Olivier,
I am fighting to test the cluster (step 8) and I would like to share with you
all some of my findings:
oscartst user is setup on all nodes (I added few lines to a post-install script
in order to be added to every node).
I checked the environment , ssh .. everything looks ok.
The OSCAR torque tests are failing. I "hacked" the script in pbs_test into
/var/lib/oscar/testing/torque to print some messages
and it looks like it fails because of the timeout (the job is running after the
timeout expired).
When checking with
qstat -f
showq
checkjob
and even tracejob
the job looks running just fine but the shelltest.out file is not generated into
/home/oscartst or /home/oscartst.
I tried different pbs scripts .. the .out and .err files are not generated.
# qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
44.hpcmaster shelltest oscartst 00:00:00 R workq
# tracejob 44
/var/lib/torque/server_logs/20130314: No matching job records located
/var/lib/torque/mom_logs/20130314: No matching job records located
/var/lib/torque/sched_logs/20130314: No such file or directory
Job: 44.hpcmaster
03/14/2013 19:24:37 A queue=workq
03/14/2013 19:24:38 A user=oscartst group=oscartst jobname=shelltest
queue=workq ctime=1363285477
qtime=1363285477 etime=1363285477 start=1363285478
owner=oscartst@hpcmaster
exec_host=epsl22/1+epsl22/0+epsl85/1+epsl85/0+epsl88/1+epsl88/0+epsl89/1+epsl89/0
Resource_List.cput=10000:00:00 Resource_List.ncpus=1
Resource_List.neednodes=4:ppn=2
Resource_List.nodect=4 Resource_List.nodes=4:ppn=2
Resource_List.walltime=10000:00:00
03/14/2013 19:29:59 A user=oscartst group=oscartst jobname=shelltest
queue=workq ctime=1363285477
qtime=1363285477 etime=1363285477 start=1363285478
owner=oscartst@hpcmaster
exec_host=epsl22/1+epsl22/0+epsl85/1+epsl85/0+epsl88/1+epsl88/0+epsl89/1+epsl89/0
Resource_List.cput=10000:00:00 Resource_List.ncpus=1
Resource_List.neednodes=4:ppn=2
Resource_List.nodect=4 Resource_List.nodes=4:ppn=2
Resource_List.walltime=10000:00:00
session=22949 end=1363285799 Exit_status=0
resources_used.cput=00:00:00
resources_used.mem=5484kb resources_used.vmem=74892kb
resources_used.walltime=00:05:21
The second issue is with Ganglia. I checked the configuration and sniffed with
tcpdump the specified multicast port.
The communication looks ok but I only see the master into the list of nodes and
the OSCAR cluster tests succeeded
only onto the head node.
When connecting to http://localhost/ganglia/ only the master data is shown.
The 3rd issue is with blcr module. The test shows that blcr is not loaded. I
tried to load it manually with insmod and it fails.
When getting the info with modinfo we get
vermagic: 2.6.32-279.2.1.el6.x86_64 SMP
My kernel version is 2.6.32-279.22.1.el6.x86_64 which is slightly different.
This must be the reason of the insmod failure.
Best Regards,
Costel SEITAN
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users