Estimado Hector, gracias por tu pronta respuesta.
El problema es que cuando en el cluster hubo actividad de varios jobs,
un usuario largó primero un cálculo en un job y luego de unas 5 horas
largó otro. El problema fue que ambos jobs fueron a parar a el mismo
nodo y los mismos cores.
Parte del comando qstat:
[root@fe ~]# qstat -f 477
Job Id: 477.fe
Job_Name = job_gr_PBE
Job_Owner = matias@fe
job_state = Q
queue = batch
server = fe
Checkpoint = u
ctime = Tue Nov 8 11:58:16 2011
Error_Path = fe:/usr/home/matias/graf/graf-graf-PBE-VdW/job_gr_PBE.e477
exec_host = n10/3+n10/2+n10/1+n10/0
exec_port = 15003+15003+15003+15003
[root@fe ~]# qstat -f 480
Job Id: 480.fe
Job_Name = job_gr_PBE
Job_Owner = matias@fe
job_state = Q
queue = batch
server = fe
Checkpoint = u
ctime = Tue Nov 8 17:26:09 2011
Error_Path = fe:/usr/home/matias/graf/graf-graf-PBE-VdW/job_gr_PBE.e480
exec_host = n10/3+n10/2+n10/1+n10/0
exec_port = 15003+15003+15003+15003
esto me da el comando tracejob para ambos job:
[root@fe ~]# tracejob 480
/var/spool/torque/mom_logs/20111108: No such file or directory
/var/spool/torque/sched_logs/20111108: No such file or directory
Job: 480.fe
11/08/2011 17:26:09 S enqueuing into batch, state 1 hop 1
11/08/2011 17:26:09 S Job Queued at request of matias@fe, owner =
matias@fe,
job name = job_gr_PBE, queue = batch
11/08/2011 17:26:09 A queue=batch
11/08/2011 17:26:10 S Job Run at request of root@fe
11/08/2011 17:26:12 S unable to run job, MOM rejected/rc=2
11/08/2011 18:26:36 S Job Run at request of root@fe
11/08/2011 18:26:38 S unable to run job, MOM rejected/rc=2
11/08/2011 19:26:45 S Job Run at request of root@fe
11/08/2011 19:26:45 S Not sending email: User does not want mail of this
type.
11/08/2011 19:26:45 A user=matias group=matias jobname=job_gr_PBE
queue=batch ctime=1320783969 qtime=1320783969
etime=1320783969 start=1320791205 owner=matias@fe
exec_host=n11/7+n11/6+n11/5+n11/4
Resource_List.neednodes=1:ppn=4
Resource_List.nodect=1
Resource_List.nodes=1:ppn=4
Resource_List.walltime=2400:00:00
11/08/2011 19:26:53 S Not sending email: User does not want mail of this
type.
11/08/2011 19:26:53 S Exit_status=0 resources_used.cput=00:00:27
resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:09
11/08/2011 19:26:53 A user=matias group=matias jobname=job_gr_PBE
queue=batch ctime=1320783969 qtime=1320783969
etime=1320783969 start=1320791205 owner=matias@fe
exec_host=n11/7+n11/6+n11/5+n11/4
Resource_List.neednodes=1:ppn=4
Resource_List.nodect=1
Resource_List.nodes=1:ppn=4
Resource_List.walltime=2400:00:00 session=8035
end=1320791213 Exit_status=0
resources_used.cput=00:00:27
resources_used.mem=0kb
resources_used.vmem=0kb
resources_used.walltime=00:00:09
11/08/2011 19:31:53 S dequeuing from batch, state COMPLETE
[root@fe ~]#
[root@fe ~]# tracejob 477
/var/spool/torque/mom_logs/20111108: No such file or directory
/var/spool/torque/sched_logs/20111108: No such file or directory
Job: 477.fe
11/08/2011 11:58:16 S enqueuing into batch, state 1 hop 1
11/08/2011 11:58:16 S Job Queued at request of matias@fe, owner =
matias@fe,
job name = job_gr_PBE, queue = batch
11/08/2011 11:58:16 A queue=batch
11/08/2011 11:58:17 S Job Run at request of root@fe
11/08/2011 11:58:19 S unable to run job, MOM rejected/rc=2
11/08/2011 12:58:34 S Job Run at request of root@fe
11/08/2011 12:58:36 S unable to run job, MOM rejected/rc=2
11/08/2011 13:58:37 S Job Run at request of root@fe
11/08/2011 13:58:39 S unable to run job, MOM rejected/rc=2
11/08/2011 14:58:43 S Job Run at request of root@fe
11/08/2011 14:58:45 S unable to run job, MOM rejected/rc=2
11/08/2011 15:59:09 S Job Run at request of root@fe
11/08/2011 15:59:11 S unable to run job, MOM rejected/rc=2
11/08/2011 16:59:30 S Job Run at request of root@fe
11/08/2011 16:59:32 S unable to run job, MOM rejected/rc=2
11/08/2011 17:59:50 S Job Run at request of root@fe
11/08/2011 17:59:52 S unable to run job, MOM rejected/rc=2
11/08/2011 19:00:02 S Job Run at request of root@fe
11/08/2011 19:00:02 S Not sending email: User does not want mail of this
type.
11/08/2011 19:00:02 A user=matias group=matias jobname=job_gr_PBE
queue=batch ctime=1320764296 qtime=1320764296
etime=1320764296 start=1320789602 owner=matias@fe
exec_host=n11/7+n11/6+n11/5+n11/4
Resource_List.neednodes=1:ppn=4
Resource_List.nodect=1
Resource_List.nodes=1:ppn=4
Resource_List.walltime=2400:00:00
11/08/2011 19:00:10 S Not sending email: User does not want mail of this
type.
11/08/2011 19:00:10 S Exit_status=0 resources_used.cput=00:00:27
resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:09
11/08/2011 19:00:10 A user=matias group=matias jobname=job_gr_PBE
queue=batch ctime=1320764296 qtime=1320764296
etime=1320764296 start=1320789602 owner=matias@fe
exec_host=n11/7+n11/6+n11/5+n11/4
Resource_List.neednodes=1:ppn=4
Resource_List.nodect=1
Resource_List.nodes=1:ppn=4
Resource_List.walltime=2400:00:00 session=7936
end=1320789610 Exit_status=0
resources_used.cput=00:00:27
resources_used.mem=0kb
resources_used.vmem=0kb
resources_used.walltime=00:00:09
11/08/2011 19:05:11 S dequeuing from batch, state COMPLETE
[root@fe ~]#
no entiendo porque no están los logs en los directorios
/var/spool/torque/mom_logs ni /var/spool/torque/sched_logs
Saludos
Fernando
----------------------------------------------------
Ing. Fernando Caba
Director General de Telecomunicaciones
Universidad Nacional del Sur
http://www.dgt.uns.edu.ar
Tel/Fax: (54)-291-4595166
Tel: (54)-291-4595101 int. 2050
Avda. Alem 1253, (B8000CPB) Bahía Blanca - Argentina
----------------------------------------------------
El 08/11/2011 07:17 PM, Hector Oliver escribió:
Cual es el estado de los jobs (tracejob #job)??
los dos te aparecen en el qstat?
se permite en tu configuración varios jobs a la ves?
On Tue, Nov 8, 2011 at 3:58 PM, Fernando Caba <[email protected]
<mailto:[email protected]>> wrote:
Hi mauiusers, i have a job that it is assigned to node10, from cores 0
to 3 and another job assigned to the same node and to the same
identical
cores (o to 3)
Somebody have any idea what is happening? I have torque-3.0.1 and
maui-3.3.1.
Thanks
--
----------------------------------------------------
Ing. Fernando Caba
Director General de Telecomunicaciones
Universidad Nacional del Sur
http://www.dgt.uns.edu.ar
Tel/Fax: (54)-291-4595166
Tel: (54)-291-4595101 int. 2050
Avda. Alem 1253, (B8000CPB) Bahía Blanca - Argentina
----------------------------------------------------
_______________________________________________
mauiusers mailing list
[email protected] <mailto:[email protected]>
http://www.supercluster.org/mailman/listinfo/mauiusers
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers