Olivier,

I am fighting to test the cluster (step 8) and I would like to share with you 
all some of my findings:

oscartst user is setup on all nodes (I added few lines to a post-install script 
in order to be added to every node).
I checked the environment , ssh .. everything looks ok.


The OSCAR torque tests are failing. I "hacked" the script in pbs_test into
/var/lib/oscar/testing/torque to print some messages
and it looks like it fails because of the timeout (the job is running after the 
timeout expired).
When checking with
qstat -f
showq
checkjob
and even tracejob
the job looks running just fine but the shelltest.out file is not generated into
/home/oscartst or /home/oscartst.
I tried different pbs scripts .. the .out and .err files are not generated.



# qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
44.hpcmaster               shelltest        oscartst        00:00:00 R workq

# tracejob 44
/var/lib/torque/server_logs/20130314: No matching job records located
/var/lib/torque/mom_logs/20130314: No matching job records located
/var/lib/torque/sched_logs/20130314: No such file or directory

Job: 44.hpcmaster

03/14/2013 19:24:37  A    queue=workq
03/14/2013 19:24:38  A    user=oscartst group=oscartst jobname=shelltest 
queue=workq ctime=1363285477
                          qtime=1363285477 etime=1363285477 start=1363285478 
owner=oscartst@hpcmaster
                          
exec_host=epsl22/1+epsl22/0+epsl85/1+epsl85/0+epsl88/1+epsl88/0+epsl89/1+epsl89/0
                          Resource_List.cput=10000:00:00 Resource_List.ncpus=1 
Resource_List.neednodes=4:ppn=2
                          Resource_List.nodect=4 Resource_List.nodes=4:ppn=2 
Resource_List.walltime=10000:00:00
03/14/2013 19:29:59  A    user=oscartst group=oscartst jobname=shelltest 
queue=workq ctime=1363285477
                          qtime=1363285477 etime=1363285477 start=1363285478 
owner=oscartst@hpcmaster
                          
exec_host=epsl22/1+epsl22/0+epsl85/1+epsl85/0+epsl88/1+epsl88/0+epsl89/1+epsl89/0
                          Resource_List.cput=10000:00:00 Resource_List.ncpus=1 
Resource_List.neednodes=4:ppn=2
                          Resource_List.nodect=4 Resource_List.nodes=4:ppn=2 
Resource_List.walltime=10000:00:00
                          session=22949 end=1363285799 Exit_status=0 
resources_used.cput=00:00:00
                          resources_used.mem=5484kb resources_used.vmem=74892kb 
resources_used.walltime=00:05:21



The second issue is with Ganglia. I checked the configuration and sniffed with 
tcpdump the specified multicast port.
The communication looks ok but I only see the master into the list of nodes and 
the OSCAR cluster tests succeeded
only onto the head node.
When connecting to http://localhost/ganglia/ only the master data is shown.

The 3rd issue is with blcr module. The test shows that blcr is not loaded. I 
tried to load it manually with insmod and it fails.
When getting the info with modinfo we get
vermagic:       2.6.32-279.2.1.el6.x86_64 SMP

My kernel version is 2.6.32-279.22.1.el6.x86_64 which is slightly different.
This must be the reason of the insmod failure.


Best Regards,
Costel SEITAN
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to