Estimado Hector, gracias por tu pronta respuesta.
El problema es que cuando en el cluster hubo actividad de varios jobs, un usuario largó primero un cálculo en un job y luego de unas 5 horas largó otro. El problema fue que ambos jobs fueron a parar a el mismo nodo y los mismos cores.

Parte del comando qstat:

[root@fe ~]# qstat -f 477
Job Id: 477.fe
    Job_Name = job_gr_PBE
    Job_Owner = matias@fe
    job_state = Q
    queue = batch
    server = fe
    Checkpoint = u
    ctime = Tue Nov  8 11:58:16 2011
    Error_Path = fe:/usr/home/matias/graf/graf-graf-PBE-VdW/job_gr_PBE.e477
    exec_host = n10/3+n10/2+n10/1+n10/0
    exec_port = 15003+15003+15003+15003

[root@fe ~]# qstat -f 480
Job Id: 480.fe
    Job_Name = job_gr_PBE
    Job_Owner = matias@fe
    job_state = Q
    queue = batch
    server = fe
    Checkpoint = u
    ctime = Tue Nov  8 17:26:09 2011
    Error_Path = fe:/usr/home/matias/graf/graf-graf-PBE-VdW/job_gr_PBE.e480
    exec_host = n10/3+n10/2+n10/1+n10/0
    exec_port = 15003+15003+15003+15003

esto me da el comando tracejob para ambos job:
[root@fe ~]# tracejob 480
/var/spool/torque/mom_logs/20111108: No such file or directory
/var/spool/torque/sched_logs/20111108: No such file or directory

Job: 480.fe

11/08/2011 17:26:09  S    enqueuing into batch, state 1 hop 1
11/08/2011 17:26:09 S Job Queued at request of matias@fe, owner = matias@fe,
                          job name = job_gr_PBE, queue = batch
11/08/2011 17:26:09  A    queue=batch
11/08/2011 17:26:10  S    Job Run at request of root@fe
11/08/2011 17:26:12  S    unable to run job, MOM rejected/rc=2
11/08/2011 18:26:36  S    Job Run at request of root@fe
11/08/2011 18:26:38  S    unable to run job, MOM rejected/rc=2
11/08/2011 19:26:45  S    Job Run at request of root@fe
11/08/2011 19:26:45  S    Not sending email: User does not want mail of this
                          type.
11/08/2011 19:26:45  A    user=matias group=matias jobname=job_gr_PBE
                          queue=batch ctime=1320783969 qtime=1320783969
                          etime=1320783969 start=1320791205 owner=matias@fe
                          exec_host=n11/7+n11/6+n11/5+n11/4
Resource_List.neednodes=1:ppn=4 Resource_List.nodect=1
                          Resource_List.nodes=1:ppn=4
                          Resource_List.walltime=2400:00:00
11/08/2011 19:26:53  S    Not sending email: User does not want mail of this
                          type.
11/08/2011 19:26:53  S    Exit_status=0 resources_used.cput=00:00:27
                          resources_used.mem=0kb resources_used.vmem=0kb
                          resources_used.walltime=00:00:09
11/08/2011 19:26:53  A    user=matias group=matias jobname=job_gr_PBE
                          queue=batch ctime=1320783969 qtime=1320783969
                          etime=1320783969 start=1320791205 owner=matias@fe
                          exec_host=n11/7+n11/6+n11/5+n11/4
Resource_List.neednodes=1:ppn=4 Resource_List.nodect=1
                          Resource_List.nodes=1:ppn=4
                          Resource_List.walltime=2400:00:00 session=8035
                          end=1320791213 Exit_status=0
resources_used.cput=00:00:27 resources_used.mem=0kb
                          resources_used.vmem=0kb
                          resources_used.walltime=00:00:09
11/08/2011 19:31:53  S    dequeuing from batch, state COMPLETE
[root@fe ~]#
[root@fe ~]# tracejob 477
/var/spool/torque/mom_logs/20111108: No such file or directory
/var/spool/torque/sched_logs/20111108: No such file or directory

Job: 477.fe

11/08/2011 11:58:16  S    enqueuing into batch, state 1 hop 1
11/08/2011 11:58:16 S Job Queued at request of matias@fe, owner = matias@fe,
                          job name = job_gr_PBE, queue = batch
11/08/2011 11:58:16  A    queue=batch
11/08/2011 11:58:17  S    Job Run at request of root@fe
11/08/2011 11:58:19  S    unable to run job, MOM rejected/rc=2
11/08/2011 12:58:34  S    Job Run at request of root@fe
11/08/2011 12:58:36  S    unable to run job, MOM rejected/rc=2
11/08/2011 13:58:37  S    Job Run at request of root@fe
11/08/2011 13:58:39  S    unable to run job, MOM rejected/rc=2
11/08/2011 14:58:43  S    Job Run at request of root@fe
11/08/2011 14:58:45  S    unable to run job, MOM rejected/rc=2
11/08/2011 15:59:09  S    Job Run at request of root@fe
11/08/2011 15:59:11  S    unable to run job, MOM rejected/rc=2
11/08/2011 16:59:30  S    Job Run at request of root@fe
11/08/2011 16:59:32  S    unable to run job, MOM rejected/rc=2
11/08/2011 17:59:50  S    Job Run at request of root@fe
11/08/2011 17:59:52  S    unable to run job, MOM rejected/rc=2
11/08/2011 19:00:02  S    Job Run at request of root@fe
11/08/2011 19:00:02  S    Not sending email: User does not want mail of this
                          type.
11/08/2011 19:00:02  A    user=matias group=matias jobname=job_gr_PBE
                          queue=batch ctime=1320764296 qtime=1320764296
                          etime=1320764296 start=1320789602 owner=matias@fe
                          exec_host=n11/7+n11/6+n11/5+n11/4
Resource_List.neednodes=1:ppn=4 Resource_List.nodect=1
                          Resource_List.nodes=1:ppn=4
                          Resource_List.walltime=2400:00:00
11/08/2011 19:00:10  S    Not sending email: User does not want mail of this
                          type.
11/08/2011 19:00:10  S    Exit_status=0 resources_used.cput=00:00:27
                          resources_used.mem=0kb resources_used.vmem=0kb
                          resources_used.walltime=00:00:09
11/08/2011 19:00:10  A    user=matias group=matias jobname=job_gr_PBE
                          queue=batch ctime=1320764296 qtime=1320764296
                          etime=1320764296 start=1320789602 owner=matias@fe
                          exec_host=n11/7+n11/6+n11/5+n11/4
Resource_List.neednodes=1:ppn=4 Resource_List.nodect=1
                          Resource_List.nodes=1:ppn=4
                          Resource_List.walltime=2400:00:00 session=7936
                          end=1320789610 Exit_status=0
resources_used.cput=00:00:27 resources_used.mem=0kb
                          resources_used.vmem=0kb
                          resources_used.walltime=00:00:09
11/08/2011 19:05:11  S    dequeuing from batch, state COMPLETE
[root@fe ~]#

no entiendo porque no están los logs en los directorios /var/spool/torque/mom_logs ni /var/spool/torque/sched_logs

Saludos

       Fernando

----------------------------------------------------

Ing. Fernando Caba
Director General de Telecomunicaciones
Universidad Nacional del Sur
http://www.dgt.uns.edu.ar
Tel/Fax: (54)-291-4595166
Tel: (54)-291-4595101 int. 2050
Avda. Alem 1253, (B8000CPB) Bahía Blanca - Argentina
----------------------------------------------------


El 08/11/2011 07:17 PM, Hector Oliver escribió:
Cual es el estado de los jobs (tracejob #job)??
los dos te aparecen en el qstat?
se permite en tu configuración varios jobs a la ves?

On Tue, Nov 8, 2011 at 3:58 PM, Fernando Caba <[email protected] <mailto:[email protected]>> wrote:

    Hi mauiusers, i have a job that it is assigned to node10, from cores 0
    to 3 and another job assigned to the same node and to the same
    identical
    cores (o to 3)
    Somebody have any idea what is happening? I have torque-3.0.1 and
    maui-3.3.1.
    Thanks

    --
    ----------------------------------------------------
    Ing. Fernando Caba
    Director General de Telecomunicaciones
    Universidad Nacional del Sur
    http://www.dgt.uns.edu.ar
    Tel/Fax: (54)-291-4595166
    Tel: (54)-291-4595101 int. 2050
    Avda. Alem 1253, (B8000CPB) Bahía Blanca - Argentina
    ----------------------------------------------------

    _______________________________________________
    mauiusers mailing list
    [email protected] <mailto:[email protected]>
    http://www.supercluster.org/mailman/listinfo/mauiusers


_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to