Hi all, I'm having problem with newly installed torque/maui system - a lot of jobs fails to run. They get assigned to the node but then become Waiting in torque.
I searched through maui, torque server and mom logs and found this (this is one of many failing jobs): showq: BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 1159111 samgrid Hold 1 3:00:00:00 Wed Feb 3 22:29:25 [r...@torque ~]# tracejob 1159111 /var/spool/torque/mom_logs/20100203: No such file or directory /var/spool/torque/sched_logs/20100203: No such file or directory Job: 1159111.torque.farm.particle.cz 02/03/2010 22:29:25 S enqueuing into d0prod, state 1 hop 1 02/03/2010 22:29:25 S Job Queued at request of [email protected], owner = [email protected], job name = Z063015370, queue = d0prod 02/03/2010 22:29:25 A queue=d0prod 02/03/2010 22:30:00 S Job Modified at request of [email protected] 02/03/2010 22:30:00 S Job Run at request of [email protected] 02/03/2010 22:30:00 S Job Modified at request of [email protected] 02/03/2010 22:30:00 S post_modify_req: PBSE_UNKJOBID for job 1159111.torque.farm.particle.cz in state RUNNING-STAGEGO, dest = salix37 [r...@torque ~]# grep 1159111 /usr/local/maui/log/maui.log 02/03 22:57:30 MJobFind('1159111',J,0) 02/03 22:57:30 MRMJobPreUpdate(1159111) 02/03 22:57:30 MPBSJobUpdate(1159111,1159111.torque.farm.particle.cz ,TaskList,0) 02/03 22:57:30 __MPBSGetTaskList(1159111,1,TaskList,0) 02/03 22:57:30 INFO: job 1159111 starttime: 1265232592 (00:27:31) presenttime: 1265234243 wclimit: 259200 mtime: 1265232600 etime: 0 walltime: 0 state: Hold 02/03 22:57:30 MRMJobPostUpdate(1159111,TaskList,Hold,base) 02/03 22:57:30 INFO: job '1159111' Priority: 1 02/03 22:57:30 INFO: job '1159111' priority: 1.00 02/03 22:57:31 INFO: job '1159111' Priority: 1 02/03 22:57:31 INFO: job '1159111' priority: 1.00 02/03 22:58:02 INFO: line: ' 1159111 samgrid 1265232592 1265232565 1 259200 - 6 1 02/03 22:58:39 MJobFind('1159111',J,0) 02/03 22:58:39 MRMJobPreUpdate(1159111) 02/03 22:58:39 MPBSJobUpdate(1159111,1159111.torque.farm.particle.cz ,TaskList,0) 02/03 22:58:39 __MPBSGetTaskList(1159111,1,TaskList,0) 02/03 22:58:39 INFO: job 1159111 starttime: 1265232592 (00:28:40) presenttime: 1265234312 wclimit: 259200 mtime: 1265232600 etime: 0 walltime: 0 state: Hold 02/03 22:58:39 MRMJobPostUpdate(1159111,TaskList,Hold,base) 02/03 22:58:40 INFO: job '1159111' Priority: 1 02/03 22:58:40 INFO: job '1159111' priority: 1.00 02/03 22:58:40 INFO: job '1159111' Priority: 1 02/03 22:58:40 INFO: job '1159111' priority: 1.00 [r...@torque ~]# ssh salix37 "grep 1159111 /var/spool/torque/mom_logs/*" /var/spool/torque/mom_logs/20100203:02/03/2010 22:30:00;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST= salix37.farm.particle.cz MSG=modify job failed, unknown job 1159111.torque.farm.particle.cz), aux=0, type=ModifyJob, from [email protected] I think the problem is somehow connected with the PBSE_UNKJOBID error, but I didn't found any solution. To me it seems strange, that the pbs_mom is staging in files, but doesn't know the job... Thank you for any help. Best regards, Jan Svec Institute of Physics AS CR
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
