Dear Dr. Stephan Raub you can check the values of maui configuration with command showconfig |more
and you can add (if is not set) the follow features .. ENABLEMULTINODEJOBS TRUE JOBNODEMATCHPOLICY EXACTNODE ENABLEMULTINODEJOBS TRUE NODEACCESSPOLICY SHARED ~ specially with JOBNODEMATCHPOLICY EXACTNODE you can submit parallel jobs with qsub -l nodes=4:ppn=1 ( 1 core per 4 nodes) or qsub -l nodes=1:ppn=3 ( 4 core per 1 nodes) also with command checkjob jobid you can take a lot of infomation for the job status and resources best E.V. -- University OF Ioannina Department of Computer Science P.O. BOX 1186 Ioannina, Greece Tel: (+30)-26510-98864 Fax: (+30)-26510-98890 Quoting "Dr. Stephan Raub" <[email protected]>: > Hello everyone > > I have a strange problem with jobs, which want to use more than one > computenode. I tried a simple test-skript like: > > > #PBS -l walltime=00:20:00 > #PBS -l nodes=2:ppn=1 > #PBS -N testjob5 > > It stays in the queue with status �Q� for eternity. qstat �f shows an > increasing number in �start_count� and �exit_status = -3�. I found out, that > the scheduler (maui) already assigned this job to two nodes. I set > $logevent=255 and $loglevel=7 for this two nodes (node2 and node3) and found > the relevant parts, which you can find below. > > Jobs with #PBS �l nodes=1:ppn=4 start normally on 4 Cores of one node. > > Please, I would really welcome any help you can give. > > Thank You in advance. > > Stephan > > > Log of Node2 > ------------ > 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;ready to commit job > 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;ready to commit job completed > 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command > Commit from PBS_Server > 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type Commit > from host .blabla.cluster received > 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type Commit > from host .blabla.cluster allowed > 12/11/2009 14:03:53;0008; pbs_mom;Job;dispatch_request;dispatching request > Commit on sd=10 > 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;committing job > 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;starting job execution > 12/11/2009 14:03:53;0001; pbs_mom;Job;job_nodes;0: .blabla-2/0 > 12/11/2009 14:03:53;0001; pbs_mom;Job;job_nodes;1: .blabla-3/0 > 12/11/2009 14:03:53;0001; pbs_mom;Job;job_nodes;job: 78.xxx numnodes=2 > numvnod=2 > 12/11/2009 14:03:53;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::init_groups, > pre-sigprocmask > 12/11/2009 14:03:53;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::init_groups, > post-initgroups > 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;job 78.xxx reported > successful start on 2 node(s) > 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;job execution started > 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command > Disconnect from PBS_Server > 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command > StatusJob from PBS_Server > 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type > StatusJob from host .blabla.cluster received > 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type > StatusJob from host .blabla.cluster allowed > 12/11/2009 14:03:53;0008; pbs_mom;Job;dispatch_request;dispatching request > StatusJob on sd=10 > 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command > ModifyJob from PBS_Server > 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type > ModifyJob from host .blabla.cluster received > 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type > ModifyJob from host .blabla.cluster allowed > 12/11/2009 14:03:53;0008; pbs_mom;Job;dispatch_request;dispatching request > ModifyJob on sd=14 > 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;modifying job > 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;modifying type 6 attribute > session_id of job (value: '???') > 12/11/2009 14:03:53;0002; > pbs_mom;n/a;mom_set_limits;mom_set_limits(78.xxx,alter) entered > 12/11/2009 14:03:53;0002; pbs_mom;n/a;mom_set_limits;setting limit for > attribute 'ncpus' > 12/11/2009 14:03:53;0002; pbs_mom;n/a;mom_set_limits;setting limit for > attribute 'neednodes' > 12/11/2009 14:03:53;0002; pbs_mom;n/a;mom_set_limits;setting limit for > attribute 'nodes' > 12/11/2009 14:03:53;0002; pbs_mom;n/a;mom_set_limits;setting limit for > attribute 'walltime' > 12/11/2009 14:03:53;0002; > pbs_mom;n/a;mom_set_limits;mom_set_limits(78.xxx,alter) completed > 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;Job Modified at request of > [email protected] > 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command > Disconnect from PBS_Server > 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command > Disconnect from PBS_Server > 12/11/2009 14:03:53;0008; pbs_mom;Job;do_rpp;got an internal task manager > request in do_rpp > 12/11/2009 14:03:53;0002; pbs_mom;Svr;im_request;connect from > 192.168.1.63:15003 > 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;received request 'ERROR' for > job 78.xxx from 192.168.1.63:15003 > 12/11/2009 14:03:53;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad UID for job > execution (15023) in 78.xxx, job_start_error from node 192.168.1.63:15003 in > job_start_error > 12/11/2009 14:03:53;0008; pbs_mom;Req;send_sisters;sending command > ABORT_JOB for job 78.xxx (10) > 12/11/2009 14:03:53;0008; pbs_mom;Req;send_sisters;sending ABORT to > sisters > 12/11/2009 14:03:53;0080; pbs_mom;Svr;scan_for_exiting;searching for > exiting jobs > 12/11/2009 14:03:53;0008; pbs_mom;Job;kill_job;scan_for_exiting: sending > signal 9, "KILL" to job 78.xxx, reason: local task termination detected > 12/11/2009 14:03:53;0002; pbs_mom;n/a;run_pelog;userepilog script > '/var/spool/torque/mom_priv/epilogue.precancel' for job 78.xxx does not > exist (cwd: /var/spool/torque/mom_priv,pid: 12854) > 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;kill_job done (killed 0 > processes) > 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;sending preobit jobstat > 12/11/2009 14:03:53;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 12/11/2009 14:03:53;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of > while loop > 12/11/2009 14:03:53;0080; pbs_mom;Svr;preobit_reply;in while loop, no > error from job stat > 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;performing job clean-up > 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;epilog subtask created with > pid 12858 - substate set to JOB_SUBSTATE_OBIT - registered post_epilogue > 12/11/2009 14:03:53;0002; pbs_mom;n/a;mom_close_poll;entered > 12/11/2009 14:03:53;0008; pbs_mom;Job;scan_for_terminated;entered > 12/11/2009 14:03:53;0080; pbs_mom;Svr;mom_get_sample;proc_array load > started > 12/11/2009 14:03:53;0080; pbs_mom;n/a;mom_get_sample;proc_array loaded - > nproc=194 > 12/11/2009 14:03:53;0080; pbs_mom;n/a;cput_sum;proc_array loop start - > jobid = 78.xxx > 12/11/2009 14:03:53;0080; pbs_mom;n/a;mem_sum;proc_array loop start - > jobid = 78.xxx > 12/11/2009 14:03:53;0080; pbs_mom;n/a;resi_sum;proc_array loop start - > jobid = 78.xxx > 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;checking job w/subtask > pid=12858 (child pid=12858) > 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;checking job post-processing > routine > 12/11/2009 14:03:53;0080; pbs_mom;Req;post_epilogue;preparing obit message > for job 78.xxx > 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;encoding "send flagged" attr: > Error_Path > 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;obit sent to server > 12/11/2009 14:03:53;0001; pbs_mom;Job;78.xxx;setting job substate to > EXITED > 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command > DeleteJob from PBS_Server > 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type > DeleteJob from host .blabla.cluster received > 12/11/2009 14:03:53;0008; pbs_mom;Job;process_request;request type > DeleteJob from host .blabla.cluster allowed > 12/11/2009 14:03:53;0008; pbs_mom;Job;dispatch_request;dispatching request > DeleteJob on sd=10 > 12/11/2009 14:03:53;0008; pbs_mom;Job;78.xxx;deleting job > 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;deleting job 78.xxx in state > EXITED > 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;removing job > 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;removed job script > 12/11/2009 14:03:53;0080; pbs_mom;Job;78.xxx;removed job file > 12/11/2009 14:03:53;0080; pbs_mom;Req;dis_request_read;decoding command > Disconnect from PBS_Server > > > > Log of Node3 > ------------ > 12/11/2009 14:03:40;0008; pbs_mom;Job;do_rpp;got an internal task manager > request in do_rpp > 12/11/2009 14:03:40;0002; pbs_mom;Svr;im_request;connect from > 192.168.1.62:1022 > 12/11/2009 14:03:40;0008; pbs_mom;Job;78.xxx;received request 'JOIN_JOB' > for job 78.xxx from 192.168.1.62:1022 > 12/11/2009 14:03:40;0008; pbs_mom;Job;78.xxx;im_request: JOIN_JOB 78.xxx > node 1 > 12/11/2009 14:03:40;0001; pbs_mom;Job;job_nodes;0: .blabla-2/0 > 12/11/2009 14:03:40;0001; pbs_mom;Job;job_nodes;1: .blabla-3/0 > 12/11/2009 14:03:40;0001; pbs_mom;Job;job_nodes;job: 78.xxx numnodes=2 > numvnod=2 > 12/11/2009 14:03:40;0008; pbs_mom;Job;78.xxx;no group entry for group > admin, user=raub, errno=0 (Success) > 12/11/2009 14:03:40;0080; pbs_mom;Job;78.xxx;removing job > 12/11/2009 14:03:40;0008; pbs_mom;Job;do_rpp;got an internal task manager > request in do_rpp > 12/11/2009 14:03:40;0002; pbs_mom;Svr;im_request;connect from > 192.168.1.62:1022 > 12/11/2009 14:03:40;0008; pbs_mom;Job;78.xxx;received request 'ABORT_JOB' > for job 78.xxx from 192.168.1.62:1022 > 12/11/2009 14:03:40;0008; pbs_mom;Job;78.xxx;ERROR: received request > 'ABORT_JOB' from 192.168.1.62:1022 for job '78.xxx' (job does not exist > locally) > > > Output of qmgr �ps� > ------------------- > create queue rhel > set queue rhel queue_type = Execution > set queue rhel from_route_only = True > set queue rhel resources_max.opsys = RHEL > set queue rhel resources_max.walltime = 360:00:00 > set queue rhel resources_min.opsys = RHEL > set queue rhel enabled = True > set queue rhel started = True > > create queue default > set queue default queue_type = Route > set queue default resources_default.opsys = SL > set queue default route_destinations = SciLinux > set queue default route_destinations += rhel > set queue default enabled = True > set queue default started = True > > set server acl_hosts = xxx > set server default_queue = default > set server log_events = 511 > set server mail_from = adm > set server resources_default.ncpus = 1 > set server resources_default.nodes = 1 > set server resources_default.opsys = SL > set server resources_default.walltime = 01:00:00 > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server next_job_number = 79 > > > > Maui.cfg > -------- > SERVERHOST xxx > > ADMIN1 root > ADMINHOST localhost > > RMTYPE[0] PBS > RMHOST[0] localhost > RMSERVER[0] localhost > > RMPOLLINTERVAL 00:00:10 > > SERVERPORT 40559 > SERVERMODE NORMAL > > LOGFILE maui.log > LOGFILEMAXSIZE 10000000 > LOGLEVEL 3 > > ENFORCERESOURCELIMITS ON > > QUEUETIMEWEIGHT 1 > > BACKFILLDEPTH 0 > BACKFILLMETRIC PROCS > BACKFILLPOLICY BESTFIT > > QOSCFG[monopol] QFLAGS=DEDICATED > > CLASSWEIGHT 10 > CLASSCFG[cuda] QLIST=monopol QDEF=monopol > > -- > --------------------------------------------------------- > | | Dr. rer. nat. Stephan Raub > | | Dipl. Chem. > | | Lehrstuhl für IT-Management / ZIM > | | Heinrich-Heine-Universität Düsseldorf Universitätsstr. 1 / > | | 25.41.O2.25-2 > | | 40225 Düsseldorf / Germany > | | > | | Tel: +49-211-811-3911 > --------------------------------------------------------- > > > _______________________________________________ > mauiusers mailing list > [email protected] > http://www.supercluster.org/mailman/listinfo/mauiusers > > _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
