Looks like pbs_mom may not be running on the nodes.
Take a look at the output of pbsnodes. See if there is corruption in the job list for nodes. If so, restart pbs_server and then look. I have seen where the job list has output of an 'env' command rather than true jobs. This would cause maui to die as well. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From: Naveed Near-Ansari [mailto:[email protected]] Sent: Thursday, February 17, 2011 5:26 PM To: Andrus, Brian Contractor Cc: [email protected] Subject: Re: [Mauiusers] maui commands hang, perhaps when scheduling I did try that, but i am not seeing obvious segfaults. when hung the last things in the logs are: 2/17 16:58:40 INFO: received service request from host 'headnode.caltech.edu' 02/17 16:58:40 MSURecvPacket(9,BufP,4,NULL,100000,SC) 02/17 16:58:41 ServerProcessRequests() 02/17 16:58:41 INFO: not rolling logs (6196137 < 10000000) 02/17 16:58:41 MResAdjust(NULL,0,0) 02/17 16:58:41 MStatInitializeActiveSysUsage() 02/17 16:58:41 MStatClearUsage([NONE],Active) 02/17 16:58:41 ServerUpdate() 02/17 16:58:41 MSysUpdateTime() 02/17 16:58:41 INFO: starting iteration 24 02/17 16:58:41 MRMGetInfo() 02/17 16:58:41 MClusterClearUsage() 02/17 16:58:41 MRMClusterQuery() 02/17 16:58:41 MPBSClusterQuery(base,RCount,SC) 02/17 16:58:41 ERROR: cannot get node info: Execution server rejected request MSG=connection to mom timed out after restarting again, the logs show the following when hung: 02/17 17:15:02 INFO: packet sent (4783 bytes of 4783) 02/17 17:15:02 MSUDisconnect(S) 02/17 17:15:04 ServerProcessRequests() 02/17 17:15:04 INFO: not rolling logs (8856651 < 10000000) 02/17 17:15:04 MResAdjust(NULL,0,0) 02/17 17:15:04 MStatInitializeActiveSysUsage() 02/17 17:15:04 MStatClearUsage([NONE],Active) 02/17 17:15:04 ServerUpdate() 02/17 17:15:04 MSysUpdateTime() 02/17 17:15:04 INFO: starting iteration 3 02/17 17:15:04 MRMGetInfo() 02/17 17:15:04 MClusterClearUsage() 02/17 17:15:04 MRMClusterQuery() 02/17 17:15:04 MPBSClusterQuery(base,RCount,SC) 02/17 17:15:04 ERROR: cannot get node info: NULL On 02/17/2011 11:12 AM, Andrus, Brian Contractor wrote: This says maui has stopped. Here is something I am running into lately and maybe you are affected by the same: Maui starts segfaulting when a job with ppn >1 is submitted. You can see that maui is segfaulting by starting it in the foreground: /<path to maui>/maui -d then submit a job with ppn >1 and see if maui dies. Brian Andrus ________________________________ From: [email protected] on behalf of Naveed Near-Ansari Sent: Thu 2/17/2011 9:40 AM To: [email protected] Subject: [Mauiusers] maui commands hang, perhaps when scheduling On our cluster we have been having issues with maui hanging. All of the coomands hand and then timeout: showq ERROR: lost connection to server ERROR: cannot request service (status) I do not see anything enlightening in the logs, and torque commands still work fine. I an not entirely certain, but it seems to happen when it is actually scheduling something that the hang occurs. Do you have any ideas as to what could cause this? I have copied the tail end of an strace when this occurs: write(3, "02/17 09:23:42 ServerProcessRequ"..., 39) = 39 stat("/opt/maui/log/maui.log", {st_mode=S_IFREG|0640, st_size=2023991, ...}) = 0 write(3, "02/17 09:23:42 INFO: not rol"..., 63) = 63 write(3, "02/17 09:23:42 MResAdjust(NULL,0"..., 36) = 36 write(3, "02/17 09:23:42 MStatInitializeAc"..., 47) = 47 write(3, "02/17 09:23:42 MStatClearUsage(["..., 46) = 46 write(3, "02/17 09:23:42 ServerUpdate()\n", 30) = 30 write(3, "02/17 09:23:42 MSysUpdateTime()\n", 32) = 32 write(3, "02/17 09:23:42 INFO: startin"..., 47) = 47 write(3, "02/17 09:23:42 MRMGetInfo()\n", 28) = 28 write(3, "02/17 09:23:42 MClusterClearUsag"..., 36) = 36 write(3, "02/17 09:23:42 MRMClusterQuery()"..., 33) = 33 write(3, "02/17 09:23:42 MPBSClusterQuery("..., 48) = 48 write(7, "+2+12+58+4maui+0+0+0", 20) = 20 poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7, revents=POLLIN}]) fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) read(7, "+2+1+0+0+63+515+32+13compute-27-"..., 262144) = 128000 poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7, revents=POLLIN}]) fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) read(7, "5 #1 SMP Mon Sep 20 07:12:06 EDT"..., 262077) = 128000 poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7, revents=POLLIN}]) fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) read(7, "8636kb,ncpus=8,loadave=0.00,netl"..., 261930) = 79914 write(3, "02/17 09:23:42 __MPBSGetNodeStat"..., 52) = 52 write(3, "02/17 09:23:42 INFO: PBS nod"..., 79) = 79 write(3, "02/17 09:23:42 MPBSNodeUpdate(co"..., 72) = 72 write(3, "02/17 09:23:42 MPBSLoadQueueInfo"..., 56) = 56 write(7, "+2+12+20+4maui+0+0+0", 20) = 20 poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 0 (Timeout) write(3, "02/17 09:23:51 ERROR: cannot "..., 73) = 73 write(7, "+2+12+59+4maui", 14) = 14 rt_sigaction(SIGALRM, {0x1, [], SA_RESTORER, 0x3a910302d0}, {SIG_DFL, [], SA_RESTORER, 0x3a910302d0}, 8) = 0 alarm(9) = 0 fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) read(7, "+2+1+0+0+6+2+1+7default+92+212+1"..., 65536) = 696 fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) read(7, 0x2aac8a69b5e0, 65536) = ? ERESTARTSYS (To be restarted) --- SIGALRM (Alarm clock) @ 0 (0) --- read(7, _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
