Costel, thanks a lot for all your effort and sharing,
- regarding the oscartst user, I though I had fixed it, I'll give a look at it.
- regarding torque, are you shure that maui is running and that its config is
ok. I had the same problem you described, but I can't remember what did fixed
it. I'm currently working on systemimager. Once it is fixed, I'll be back in
this kind of testing.
- finaly, blcr, I'll rebuild the module today. The blcr spec file from upstream
is really problematic as it needs to be rebuilt each time a new kernel is
released. More over, you need to reboot with the new kernel to build it. My aim
is to use dkms to build the module if needed at boot. I've seen on the web a
spec file for blcr that does this, but It was not fully satisfactory. I'd like
to include a switch to build eaither the blcr-modules or the dkms-blcr or both
and submit it upstream so distro makers can choose to rebuilt the static module
each time they provide a new kernel and external repos can choose the second
option (dkms) and have at each boot a functional module.
For the moment, I'll rebuild the module today.
Best regards.
--
Olivier LAHAYE
CEA DRT/LIST/DCSI/DIR
________________________________
De : Costel Seitan [csei...@slb.com]
Date d'envoi : jeudi 14 mars 2013 19:48
À : oscar-users@lists.sourceforge.net
Objet : [Oscar-users] Oscar Test Cluster Setup
Olivier,
I am fighting to test the cluster (step 8) and I would like to share with you
all some of my findings:
oscartst user is setup on all nodes (I added few lines to a post-install script
in order to be added to every node).
I checked the environment , ssh .. everything looks ok.
The OSCAR torque tests are failing. I “hacked” the script in pbs_test into
/var/lib/oscar/testing/torque to print some messages
and it looks like it fails because of the timeout (the job is running after the
timeout expired).
When checking with
qstat –f
showq
checkjob
and even tracejob
the job looks running just fine but the shelltest.out file is not generated into
/home/oscartst or /home/oscartst.
I tried different pbs scripts .. the .out and .err files are not generated.
# qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
44.hpcmaster shelltest oscartst 00:00:00 R workq
# tracejob 44
/var/lib/torque/server_logs/20130314: No matching job records located
/var/lib/torque/mom_logs/20130314: No matching job records located
/var/lib/torque/sched_logs/20130314: No such file or directory
Job: 44.hpcmaster
03/14/2013 19:24:37 A queue=workq
03/14/2013 19:24:38 A user=oscartst group=oscartst jobname=shelltest
queue=workq ctime=1363285477
qtime=1363285477 etime=1363285477 start=1363285478
owner=oscartst@hpcmaster
exec_host=epsl22/1+epsl22/0+epsl85/1+epsl85/0+epsl88/1+epsl88/0+epsl89/1+epsl89/0
Resource_List.cput=10000:00:00 Resource_List.ncpus=1
Resource_List.neednodes=4:ppn=2
Resource_List.nodect=4 Resource_List.nodes=4:ppn=2
Resource_List.walltime=10000:00:00
03/14/2013 19:29:59 A user=oscartst group=oscartst jobname=shelltest
queue=workq ctime=1363285477
qtime=1363285477 etime=1363285477 start=1363285478
owner=oscartst@hpcmaster
exec_host=epsl22/1+epsl22/0+epsl85/1+epsl85/0+epsl88/1+epsl88/0+epsl89/1+epsl89/0
Resource_List.cput=10000:00:00 Resource_List.ncpus=1
Resource_List.neednodes=4:ppn=2
Resource_List.nodect=4 Resource_List.nodes=4:ppn=2
Resource_List.walltime=10000:00:00
session=22949 end=1363285799 Exit_status=0
resources_used.cput=00:00:00
resources_used.mem=5484kb resources_used.vmem=74892kb
resources_used.walltime=00:05:21
The second issue is with Ganglia. I checked the configuration and sniffed with
tcpdump the specified multicast port.
The communication looks ok but I only see the master into the list of nodes and
the OSCAR cluster tests succeeded
only onto the head node.
When connecting to http://localhost/ganglia/ only the master data is shown.
The 3rd issue is with blcr module. The test shows that blcr is not loaded. I
tried to load it manually with insmod and it fails.
When getting the info with modinfo we get
vermagic: 2.6.32-279.2.1.el6.x86_64 SMP
My kernel version is 2.6.32-279.22.1.el6.x86_64 which is slightly different.
This must be the reason of the insmod failure.
Best Regards,
Costel SEITAN
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users