Costel, thanks a lot for all your effort and sharing,

- regarding the oscartst user, I though I had fixed it, I'll give a look at it.
- regarding torque, are you shure that maui is running and that its config is 
ok. I had the same problem you described, but I can't remember what did fixed 
it. I'm currently working on systemimager. Once it is fixed, I'll be back in 
this kind of testing.
- finaly, blcr, I'll rebuild the module today. The blcr spec file from upstream 
is really problematic as it needs to be rebuilt each time a new kernel is 
released. More over, you need to reboot with the new kernel to build it. My aim 
is to use dkms to build the module if needed at boot. I've seen on the web a 
spec file for blcr that does this, but It was not fully satisfactory. I'd like 
to include a switch to build eaither the blcr-modules or the dkms-blcr or both 
and submit it upstream so distro makers can choose to rebuilt the static module 
each time they provide a new kernel and external repos can choose the second 
option (dkms) and have at each boot a functional module.
For the moment, I'll rebuild the module today.

Best regards.

--
   Olivier LAHAYE
   CEA DRT/LIST/DCSI/DIR
________________________________
De : Costel Seitan [csei...@slb.com]
Date d'envoi : jeudi 14 mars 2013 19:48
À : oscar-users@lists.sourceforge.net
Objet : [Oscar-users] Oscar Test Cluster Setup

Olivier,

I am fighting to test the cluster (step 8) and I would like to share with you 
all some of my findings:

oscartst user is setup on all nodes (I added few lines to a post-install script 
in order to be added to every node).
I checked the environment , ssh .. everything looks ok.


The OSCAR torque tests are failing. I “hacked” the script in pbs_test into
/var/lib/oscar/testing/torque to print some messages
and it looks like it fails because of the timeout (the job is running after the 
timeout expired).
When checking with
qstat –f
showq
checkjob
and even tracejob
the job looks running just fine but the shelltest.out file is not generated into
/home/oscartst or /home/oscartst.
I tried different pbs scripts .. the .out and .err files are not generated.



# qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
44.hpcmaster               shelltest        oscartst        00:00:00 R workq

# tracejob 44
/var/lib/torque/server_logs/20130314: No matching job records located
/var/lib/torque/mom_logs/20130314: No matching job records located
/var/lib/torque/sched_logs/20130314: No such file or directory

Job: 44.hpcmaster

03/14/2013 19:24:37  A    queue=workq
03/14/2013 19:24:38  A    user=oscartst group=oscartst jobname=shelltest 
queue=workq ctime=1363285477
                          qtime=1363285477 etime=1363285477 start=1363285478 
owner=oscartst@hpcmaster
                          
exec_host=epsl22/1+epsl22/0+epsl85/1+epsl85/0+epsl88/1+epsl88/0+epsl89/1+epsl89/0
                          Resource_List.cput=10000:00:00 Resource_List.ncpus=1 
Resource_List.neednodes=4:ppn=2
                          Resource_List.nodect=4 Resource_List.nodes=4:ppn=2 
Resource_List.walltime=10000:00:00
03/14/2013 19:29:59  A    user=oscartst group=oscartst jobname=shelltest 
queue=workq ctime=1363285477
                          qtime=1363285477 etime=1363285477 start=1363285478 
owner=oscartst@hpcmaster
                          
exec_host=epsl22/1+epsl22/0+epsl85/1+epsl85/0+epsl88/1+epsl88/0+epsl89/1+epsl89/0
                          Resource_List.cput=10000:00:00 Resource_List.ncpus=1 
Resource_List.neednodes=4:ppn=2
                          Resource_List.nodect=4 Resource_List.nodes=4:ppn=2 
Resource_List.walltime=10000:00:00
                          session=22949 end=1363285799 Exit_status=0 
resources_used.cput=00:00:00
                          resources_used.mem=5484kb resources_used.vmem=74892kb 
resources_used.walltime=00:05:21



The second issue is with Ganglia. I checked the configuration and sniffed with 
tcpdump the specified multicast port.
The communication looks ok but I only see the master into the list of nodes and 
the OSCAR cluster tests succeeded
only onto the head node.
When connecting to http://localhost/ganglia/ only the master data is shown.

The 3rd issue is with blcr module. The test shows that blcr is not loaded. I 
tried to load it manually with insmod and it fails.
When getting the info with modinfo we get
vermagic:       2.6.32-279.2.1.el6.x86_64 SMP

My kernel version is 2.6.32-279.22.1.el6.x86_64 which is slightly different.
This must be the reason of the insmod failure.


Best Regards,
Costel SEITAN
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to