My apologies to you that may get this twice, but it's unclear to me as to which list this needs to hit.
We run a small batch farm, 1 head that is also a server along with 4 other dedicated servers. On random occasions, never close together timewise, we see... [eob_merge@srvBatchHead01 ~]$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 50982.srvbatchhead01 FS3R140152_000 eob_merge 0 Q batch 50983.srvbatchhead01 FS3R140152_001 eob_merge 0 Q batch 50984.srvbatchhead01 FS3R140152_002 eob_merge 0 Q batch 50985.srvbatchhead01 FS3R140152_003 eob_merge 0 Q batch 50986.srvbatchhead01 FS3R140152_004 eob_merge 0 Q batch Any further submissions also go into the "Q" status. I've been working with our IT department and today when we hit this we discovered that maui was hung. Simply trying to "stop" maui fails. We ended up sending a SIGTERM to the maui process and at that point it ended. doing a "start" on maui got us a new process. The IT staff person who installed maui & torque sent me the following "snippet" of the "PBS_Server" log from yesterday into today. I have edited out duplicate lines for brevity: 03/26/2014 00:54:58;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 00:56:19;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command time 03/26/2014 00:59:58;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 01:04:58;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 01:06:19;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command time 03/26/2014 01:09:58;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 -snip- 03/26/2014 14:09:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 14:14:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 14:16:20;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command time 03/26/2014 14:19:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 14:24:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 14:25:36;0100;PBS_Server;Job;50977.srvbatchhead01.stoneeagle.com;enqueuing into batch, state 1 hop 1 03/26/2014 14:25:36;0002;PBS_Server;Svr;Act;Account file /var/lib/torque/server_priv/accounting/20140326 opened 03/26/2014 14:25:36;0008;PBS_Server;Job;50977.srvbatchhead01.stoneeagle.com;Job Queued at request of [email protected]<mailto:[email protected]>, owner = [email protected]<mailto:[email protected]>, job name = FS3Q140146_000, queue = batch 03/26/2014 14:25:36;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command new 03/26/2014 14:25:36;0100;PBS_Server;Job;50978.srvbatchhead01.stoneeagle.com;enqueuing into batch, state 1 hop 1 03/26/2014 14:25:36;0008;PBS_Server;Job;50978.srvbatchhead01.stoneeagle.com;Job Queued at request of [email protected]<mailto:[email protected]>, owner = [email protected]<mailto:[email protected]>, job name = FS3Q140146_001, queue = batch 03/26/2014 14:25:36;0100;PBS_Server;Job;50979.srvbatchhead01.stoneeagle.com;enqueuing into batch, state 1 hop 1 03/26/2014 14:25:36;0008;PBS_Server;Job;50979.srvbatchhead01.stoneeagle.com;Job Queued at request of [email protected]<mailto:[email protected]>, owner = [email protected]<mailto:[email protected]>, job name = FS3Q140146_002, queue = batch 03/26/2014 14:25:36;0100;PBS_Server;Job;50980.srvbatchhead01.stoneeagle.com;enqueuing into batch, state 1 hop 1 03/26/2014 14:25:36;0008;PBS_Server;Job;50980.srvbatchhead01.stoneeagle.com;Job Queued at request of [email protected]<mailto:[email protected]>, owner = [email protected]<mailto:[email protected]>, job name = FS3Q140146_003, queue = batch 03/26/2014 14:25:36;0100;PBS_Server;Job;50981.srvbatchhead01.stoneeagle.com;enqueuing into batch, state 1 hop 1 03/26/2014 14:25:36;0008;PBS_Server;Job;50981.srvbatchhead01.stoneeagle.com;Job Queued at request of [email protected]<mailto:[email protected]>, owner = [email protected]<mailto:[email protected]>, job name = FS3Q140146_004, queue = batch 03/26/2014 14:25:36;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command new 03/26/2014 14:25:36;0008;PBS_Server;Job;50977.srvbatchhead01.stoneeagle.com;Job Run at request of [email protected]<mailto:[email protected]> 03/26/2014 14:25:37;0008;PBS_Server;Job;50978.srvbatchhead01.stoneeagle.com;Job Run at request of [email protected]<mailto:[email protected]> 03/26/2014 14:25:37;000d;PBS_Server;Job;50977.srvbatchhead01.stoneeagle.com;Not sending email: User does not want mail of this type. 03/26/2014 14:25:37;0008;PBS_Server;Job;50979.srvbatchhead01.stoneeagle.com;Job Run at request of [email protected]<mailto:[email protected]> 03/26/2014 14:25:37;000d;PBS_Server;Job;50978.srvbatchhead01.stoneeagle.com;Not sending email: User does not want mail of this type. 03/26/2014 14:25:37;0008;PBS_Server;Job;50980.srvbatchhead01.stoneeagle.com;Job Run at request of [email protected]<mailto:[email protected]> 03/26/2014 14:25:37;000d;PBS_Server;Job;50979.srvbatchhead01.stoneeagle.com;Not sending email: User does not want mail of this type. 03/26/2014 14:25:37;0008;PBS_Server;Job;50981.srvbatchhead01.stoneeagle.com;Job Run at request of [email protected]<mailto:[email protected]> 03/26/2014 14:25:37;000d;PBS_Server;Job;50980.srvbatchhead01.stoneeagle.com;Not sending email: User does not want mail of this type. 03/26/2014 14:25:37;000d;PBS_Server;Job;50981.srvbatchhead01.stoneeagle.com;Not sending email: User does not want mail of this type. 03/26/2014 14:29:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 14:30:33;000d;PBS_Server;Job;50981.srvbatchhead01.stoneeagle.com;Not sending email: User does not want mail of this type. 03/26/2014 14:30:33;0010;PBS_Server;Job;50981.srvbatchhead01.stoneeagle.com;Exit_status=0 resources_used.cput=00:00:07 resources_used.mem=38136kb resources_used.vmem=246096kb resources_used.walltime=00:04:56 03/26/2014 14:30:33;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x173bde0 (substate=50) 03/26/2014 14:30:33;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x173bde0 (substate=51) 03/26/2014 14:30:35;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x173bde0 (substate=51) 03/26/2014 14:30:35;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x173bde0 (substate=53) 03/26/2014 14:30:35;0100;PBS_Server;Job;50981.srvbatchhead01.stoneeagle.com;dequeuing from batch, state COMPLETE 03/26/2014 14:30:35;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command term 03/26/2014 14:31:27;000d;PBS_Server;Job;50978.srvbatchhead01.stoneeagle.com;Not sending email: User does not want mail of this type. 03/26/2014 14:31:27;0010;PBS_Server;Job;50978.srvbatchhead01.stoneeagle.com;Exit_status=0 resources_used.cput=00:00:09 resources_used.mem=45792kb resources_used.vmem=252520kb resources_used.walltime=00:05:50 03/26/2014 14:31:27;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x184b830 (substate=50) 03/26/2014 14:31:27;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x184b830 (substate=51) 03/26/2014 14:31:28;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x184b830 (substate=51) 03/26/2014 14:31:28;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x184b830 (substate=53) 03/26/2014 14:31:28;0100;PBS_Server;Job;50978.srvbatchhead01.stoneeagle.com;dequeuing from batch, state COMPLETE 03/26/2014 14:31:28;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command term 03/26/2014 14:31:30;000d;PBS_Server;Job;50977.srvbatchhead01.stoneeagle.com;Not sending email: User does not want mail of this type. 03/26/2014 14:31:30;0010;PBS_Server;Job;50977.srvbatchhead01.stoneeagle.com;Exit_status=0 resources_used.cput=00:00:15 resources_used.mem=35160kb resources_used.vmem=242300kb resources_used.walltime=00:05:54 03/26/2014 14:31:30;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x184a5c0 (substate=50) 03/26/2014 14:31:30;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x184a5c0 (substate=51) 03/26/2014 14:31:30;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x184a5c0 (substate=51) 03/26/2014 14:31:30;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x184a5c0 (substate=53) 03/26/2014 14:31:30;0100;PBS_Server;Job;50977.srvbatchhead01.stoneeagle.com;dequeuing from batch, state COMPLETE 03/26/2014 14:31:30;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command term 03/26/2014 14:31:35;000d;PBS_Server;Job;50980.srvbatchhead01.stoneeagle.com;Not sending email: User does not want mail of this type. 03/26/2014 14:31:35;0010;PBS_Server;Job;50980.srvbatchhead01.stoneeagle.com;Exit_status=0 resources_used.cput=00:00:11 resources_used.mem=40860kb resources_used.vmem=247476kb resources_used.walltime=00:05:58 03/26/2014 14:31:35;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x17485c0 (substate=50) 03/26/2014 14:31:35;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x17485c0 (substate=51) 03/26/2014 14:31:35;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x17485c0 (substate=51) 03/26/2014 14:31:35;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x17485c0 (substate=53) 03/26/2014 14:31:35;0100;PBS_Server;Job;50980.srvbatchhead01.stoneeagle.com;dequeuing from batch, state COMPLETE 03/26/2014 14:31:35;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command term 03/26/2014 14:31:39;000d;PBS_Server;Job;50979.srvbatchhead01.stoneeagle.com;Not sending email: User does not want mail of this type. 03/26/2014 14:31:39;0010;PBS_Server;Job;50979.srvbatchhead01.stoneeagle.com;Exit_status=0 resources_used.cput=00:00:12 resources_used.mem=45508kb resources_used.vmem=252128kb resources_used.walltime=00:06:02 03/26/2014 14:31:39;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x1747350 (substate=50) 03/26/2014 14:31:39;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x1747350 (substate=51) 03/26/2014 14:31:39;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x1747350 (substate=51) 03/26/2014 14:31:39;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0x1747350 (substate=53) 03/26/2014 14:31:39;0100;PBS_Server;Job;50979.srvbatchhead01.stoneeagle.com;dequeuing from batch, state COMPLETE 03/26/2014 14:31:39;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command term 03/26/2014 14:34:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 14:39:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 14:41:39;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command time 03/26/2014 14:44:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 14:49:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 -snip- 03/26/2014 21:24:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 21:29:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 21:31:39;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command time 03/26/2014 21:31:39;0080;PBS_Server;Req;dis_request_read;req header bad, dis error 7 (Premature end of message), type=Connect 03/26/2014 21:31:39;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Connect, from @ 03/26/2014 21:31:39;0002;PBS_Server;Req;dis_reply_write;DIS reply failure, -1 03/26/2014 21:34:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 21:39:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 21:41:39;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command time 03/26/2014 21:41:39;0080;PBS_Server;Req;dis_request_read;req header bad, dis error 7 (Premature end of message), type=Connect 03/26/2014 21:41:39;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Connect, from @ 03/26/2014 21:41:39;0002;PBS_Server;Req;dis_reply_write;DIS reply failure, -1 03/26/2014 21:44:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 21:49:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 21:51:39;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command time 03/26/2014 21:51:39;0080;PBS_Server;Req;dis_request_read;req header bad, dis error 7 (Premature end of message), type=Connect 03/26/2014 21:51:39;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Connect, from @ 03/26/2014 21:51:39;0002;PBS_Server;Req;dis_reply_write;DIS reply failure, -1 03/26/2014 21:54:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 21:59:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 22:01:39;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command time 03/26/2014 22:01:39;0080;PBS_Server;Req;dis_request_read;req header bad, dis error 7 (Premature end of message), type=Connect 03/26/2014 22:01:39;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Connect, from @ 03/26/2014 22:01:39;0002;PBS_Server;Req;dis_reply_write;DIS reply failure, -1 03/26/2014 22:04:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 22:09:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 22:11:39;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command time 03/26/2014 22:11:39;0080;PBS_Server;Req;dis_request_read;req header bad, dis error 7 (Premature end of message), type=Connect 03/26/2014 22:11:39;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Connect, from @ 03/26/2014 22:11:39;0002;PBS_Server;Req;dis_reply_write;DIS reply failure, -1 03/26/2014 22:14:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 22:19:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 22:21:39;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command time 03/26/2014 22:24:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 22:29:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 22:34:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 22:36:41;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, connection 10 to host 3232297818 has timed out after 900 seconds - closing stale connection 03/26/2014 22:39:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 22:41:39;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command time 03/26/2014 22:44:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 22:49:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 22:54:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 22:56:41;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, connection 10 to host 3232297818 has timed out after 900 seconds - closing stale connection 03/26/2014 22:59:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 23:01:39;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command time 03/26/2014 23:04:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 23:05:35;0080;PBS_Server;Req;dis_request_read;req header bad, dis error 7 (Premature end of message), type=Connect 03/26/2014 23:05:35;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Connect, from @ 03/26/2014 23:05:35;0002;PBS_Server;Req;dis_reply_write;DIS reply failure, -1 03/26/2014 23:09:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 23:11:39;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command time 03/26/2014 23:14:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 23:19:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 23:24:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 23:26:42;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, connection 10 to host 3232297818 has timed out after 900 seconds - closing stale connection 03/26/2014 23:29:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 23:31:39;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command time 03/26/2014 23:34:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 23:39:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 23:44:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 23:46:46;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::wait_request, connection 10 to host 3232297818 has timed out after 900 seconds - closing stale connection 03/26/2014 23:49:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 23:51:39;0040;PBS_Server;Svr;srvbatchhead01.stoneeagle.com;Scheduler was sent the command time 03/26/2014 23:54:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/26/2014 23:55:39;0080;PBS_Server;Req;dis_request_read;req header bad, dis error 7 (Premature end of message), type=Connect 03/26/2014 23:55:39;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Connect, from @ 03/26/2014 23:55:39;0002;PBS_Server;Req;dis_reply_write;DIS reply failure, -1 03/26/2014 23:59:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.7, loglevel = 0 03/27/2014 00:01:39;0002;PBS_Server;Svr;Log;Log closed If this rings any bells for anyone or anyone can offer any assistance it would truly be appreicated. Kind regards, Jack Wilkinson, Programmer Services | VPay(r) P: 972.367-6622 [email protected]<mailto:[email protected]> www.stoneeagle.com<http://www.stoneeagle.com/> www.vpayusa.com<http://www.vpayusa.com/> 111 W. Spring Valley Rd., #100 Richardson, TX 75081 CONFIDENTIALITY NOTICE: This email, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure, or distribution is prohibited. If you received this email and are not the intended recipient, please inform the sender by email reply and destroy all copies of the original message.
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
