Hello everyone I have a strange problem with jobs, which want to use more than one computenode. I tried a simple test-skript like:
#PBS -l walltime=00:20:00 #PBS -l nodes=2:ppn=1 #PBS -N testjob5 It stays in the queue with status Q for eternity. qstat f shows an increasing number in start_count and exit_status = -3. I found out, that the scheduler (maui) already assigned this job to two nodes. I set $logevent=255 and $loglevel=7 for this two nodes (node2 and node3) and found the relevant parts, which you can find below. Jobs with #PBS l nodes=1:ppn=4 start normally on 4 Cores of one node. Please, I would really welcome any help you can give. Thank You in advance. Stephan Log of Node2 ------------ 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;ready to commit job 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;ready to commit job completed 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command Commit from PBS_Server 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type Commit from host .blabla.cluster received 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type Commit from host .blabla.cluster allowed 12/11/2009 14:03:53;0008; pbs_mom;Job;dispatch_request;dispatching request Commit on sd=10 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;committing job 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;starting job execution 12/11/2009 14:03:53;0001; pbs_mom;Job;job_nodes;0: .blabla-2/0 12/11/2009 14:03:53;0001; pbs_mom;Job;job_nodes;1: .blabla-3/0 12/11/2009 14:03:53;0001; pbs_mom;Job;job_nodes;job: 78.xxx numnodes=2 numvnod=2 12/11/2009 14:03:53;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::init_groups, pre-sigprocmask 12/11/2009 14:03:53;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::init_groups, post-initgroups 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;job 78.xxx reported successful start on 2 node(s) 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;job execution started 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command Disconnect from PBS_Server 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command StatusJob from PBS_Server 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type StatusJob from host .blabla.cluster received 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type StatusJob from host .blabla.cluster allowed 12/11/2009 14:03:53;0008; pbs_mom;Job;dispatch_request;dispatching request StatusJob on sd=10 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command ModifyJob from PBS_Server 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type ModifyJob from host .blabla.cluster received 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type ModifyJob from host .blabla.cluster allowed 12/11/2009 14:03:53;0008; pbs_mom;Job;dispatch_request;dispatching request ModifyJob on sd=14 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;modifying job 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;modifying type 6 attribute session_id of job (value: '???') 12/11/2009 14:03:53;0002; pbs_mom;n/a;mom_set_limits;mom_set_limits(78.xxx,alter) entered 12/11/2009 14:03:53;0002; pbs_mom;n/a;mom_set_limits;setting limit for attribute 'ncpus' 12/11/2009 14:03:53;0002; pbs_mom;n/a;mom_set_limits;setting limit for attribute 'neednodes' 12/11/2009 14:03:53;0002; pbs_mom;n/a;mom_set_limits;setting limit for attribute 'nodes' 12/11/2009 14:03:53;0002; pbs_mom;n/a;mom_set_limits;setting limit for attribute 'walltime' 12/11/2009 14:03:53;0002; pbs_mom;n/a;mom_set_limits;mom_set_limits(78.xxx,alter) completed 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;Job Modified at request of [email protected] 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command Disconnect from PBS_Server 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command Disconnect from PBS_Server 12/11/2009 14:03:53;0008; pbs_mom;Job;do_rpp;got an internal task manager request in do_rpp 12/11/2009 14:03:53;0002; pbs_mom;Svr;im_request;connect from 192.168.1.63:15003 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;received request 'ERROR' for job 78.xxx from 192.168.1.63:15003 12/11/2009 14:03:53;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad UID for job execution (15023) in 78.xxx, job_start_error from node 192.168.1.63:15003 in job_start_error 12/11/2009 14:03:53;0008; pbs_mom;Req;send_sisters;sending command ABORT_JOB for job 78.xxx (10) 12/11/2009 14:03:53;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters 12/11/2009 14:03:53;0080; pbs_mom;Svr;scan_for_exiting;searching for exiting jobs 12/11/2009 14:03:53;0008; pbs_mom;Job;kill_job;scan_for_exiting: sending signal 9, "KILL" to job 78.xxx, reason: local task termination detected 12/11/2009 14:03:53;0002; pbs_mom;n/a;run_pelog;userepilog script '/var/spool/torque/mom_priv/epilogue.precancel' for job 78.xxx does not exist (cwd: /var/spool/torque/mom_priv,pid: 12854) 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;kill_job done (killed 0 processes) 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;sending preobit jobstat 12/11/2009 14:03:53;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 12/11/2009 14:03:53;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 12/11/2009 14:03:53;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;performing job clean-up 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;epilog subtask created with pid 12858 - substate set to JOB_SUBSTATE_OBIT - registered post_epilogue 12/11/2009 14:03:53;0002; pbs_mom;n/a;mom_close_poll;entered 12/11/2009 14:03:53;0008; pbs_mom;Job;scan_for_terminated;entered 12/11/2009 14:03:53;0080; pbs_mom;Svr;mom_get_sample;proc_array load started 12/11/2009 14:03:53;0080; pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=194 12/11/2009 14:03:53;0080; pbs_mom;n/a;cput_sum;proc_array loop start - jobid = 78.xxx 12/11/2009 14:03:53;0080; pbs_mom;n/a;mem_sum;proc_array loop start - jobid = 78.xxx 12/11/2009 14:03:53;0080; pbs_mom;n/a;resi_sum;proc_array loop start - jobid = 78.xxx 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;checking job w/subtask pid=12858 (child pid=12858) 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;checking job post-processing routine 12/11/2009 14:03:53;0080; pbs_mom;Req;post_epilogue;preparing obit message for job 78.xxx 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;encoding "send flagged" attr: Error_Path 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;obit sent to server 12/11/2009 14:03:53;0001; pbs_mom;Job;78.xxx;setting job substate to EXITED 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command DeleteJob from PBS_Server 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type DeleteJob from host .blabla.cluster received 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type DeleteJob from host .blabla.cluster allowed 12/11/2009 14:03:53;0008; pbs_mom;Job;dispatch_request;dispatching request DeleteJob on sd=10 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;deleting job 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;deleting job 78.xxx in state EXITED 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;removing job 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;removed job script 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;removed job file 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command Disconnect from PBS_Server Log of Node3 ------------ 12/11/2009 14:03:40;0008; pbs_mom;Job;do_rpp;got an internal task manager request in do_rpp 12/11/2009 14:03:40;0002; pbs_mom;Svr;im_request;connect from 192.168.1.62:1022 12/11/2009 14:03:40;0008; pbs_mom;Job;78.xxx;received request 'JOIN_JOB' for job 78.xxx from 192.168.1.62:1022 12/11/2009 14:03:40;0008; pbs_mom;Job;78.xxx;im_request: JOIN_JOB 78.xxx node 1 12/11/2009 14:03:40;0001; pbs_mom;Job;job_nodes;0: .blabla-2/0 12/11/2009 14:03:40;0001; pbs_mom;Job;job_nodes;1: .blabla-3/0 12/11/2009 14:03:40;0001; pbs_mom;Job;job_nodes;job: 78.xxx numnodes=2 numvnod=2 12/11/2009 14:03:40;0008; pbs_mom;Job;78.xxx;no group entry for group admin, user=raub, errno=0 (Success) 12/11/2009 14:03:40;0080; pbs_mom;Job;78.xxx;removing job 12/11/2009 14:03:40;0008; pbs_mom;Job;do_rpp;got an internal task manager request in do_rpp 12/11/2009 14:03:40;0002; pbs_mom;Svr;im_request;connect from 192.168.1.62:1022 12/11/2009 14:03:40;0008; pbs_mom;Job;78.xxx;received request 'ABORT_JOB' for job 78.xxx from 192.168.1.62:1022 12/11/2009 14:03:40;0008; pbs_mom;Job;78.xxx;ERROR: received request 'ABORT_JOB' from 192.168.1.62:1022 for job '78.xxx' (job does not exist locally) Output of qmgr ps ------------------- create queue rhel set queue rhel queue_type = Execution set queue rhel from_route_only = True set queue rhel resources_max.opsys = RHEL set queue rhel resources_max.walltime = 360:00:00 set queue rhel resources_min.opsys = RHEL set queue rhel enabled = True set queue rhel started = True create queue default set queue default queue_type = Route set queue default resources_default.opsys = SL set queue default route_destinations = SciLinux set queue default route_destinations += rhel set queue default enabled = True set queue default started = True set server acl_hosts = xxx set server default_queue = default set server log_events = 511 set server mail_from = adm set server resources_default.ncpus = 1 set server resources_default.nodes = 1 set server resources_default.opsys = SL set server resources_default.walltime = 01:00:00 set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server next_job_number = 79 Maui.cfg -------- SERVERHOST xxx ADMIN1 root ADMINHOST localhost RMTYPE[0] PBS RMHOST[0] localhost RMSERVER[0] localhost RMPOLLINTERVAL 00:00:10 SERVERPORT 40559 SERVERMODE NORMAL LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 ENFORCERESOURCELIMITS ON QUEUETIMEWEIGHT 1 BACKFILLDEPTH 0 BACKFILLMETRIC PROCS BACKFILLPOLICY BESTFIT QOSCFG[monopol] QFLAGS=DEDICATED CLASSWEIGHT 10 CLASSCFG[cuda] QLIST=monopol QDEF=monopol -- --------------------------------------------------------- | | Dr. rer. nat. Stephan Raub | | Dipl. Chem. | | Lehrstuhl für IT-Management / ZIM | | Heinrich-Heine-Universität Düsseldorf Universitätsstr. 1 / | | 25.41.O2.25-2 | | 40225 Düsseldorf / Germany | | | | Tel: +49-211-811-3911 --------------------------------------------------------- _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
