I did try that, but i am not seeing obvious segfaults. when hung the last things in the logs are:
2/17 16:58:40 INFO: received service request from host 'headnode.caltech.edu' 02/17 16:58:40 MSURecvPacket(9,BufP,4,NULL,100000,SC) 02/17 16:58:41 ServerProcessRequests() 02/17 16:58:41 INFO: not rolling logs (6196137 < 10000000) 02/17 16:58:41 MResAdjust(NULL,0,0) 02/17 16:58:41 MStatInitializeActiveSysUsage() 02/17 16:58:41 MStatClearUsage([NONE],Active) 02/17 16:58:41 ServerUpdate() 02/17 16:58:41 MSysUpdateTime() 02/17 16:58:41 INFO: starting iteration 24 02/17 16:58:41 MRMGetInfo() 02/17 16:58:41 MClusterClearUsage() 02/17 16:58:41 MRMClusterQuery() 02/17 16:58:41 MPBSClusterQuery(base,RCount,SC) 02/17 16:58:41 ERROR: cannot get node info: Execution server rejected request MSG=connection to mom timed out after restarting again, the logs show the following when hung: 02/17 17:15:02 INFO: packet sent (4783 bytes of 4783) 02/17 17:15:02 MSUDisconnect(S) 02/17 17:15:04 ServerProcessRequests() 02/17 17:15:04 INFO: not rolling logs (8856651 < 10000000) 02/17 17:15:04 MResAdjust(NULL,0,0) 02/17 17:15:04 MStatInitializeActiveSysUsage() 02/17 17:15:04 MStatClearUsage([NONE],Active) 02/17 17:15:04 ServerUpdate() 02/17 17:15:04 MSysUpdateTime() 02/17 17:15:04 INFO: starting iteration 3 02/17 17:15:04 MRMGetInfo() 02/17 17:15:04 MClusterClearUsage() 02/17 17:15:04 MRMClusterQuery() 02/17 17:15:04 MPBSClusterQuery(base,RCount,SC) 02/17 17:15:04 ERROR: cannot get node info: NULL On 02/17/2011 11:12 AM, Andrus, Brian Contractor wrote: > This says maui has stopped. > > Here is something I am running into lately and maybe you are affected > by the same: > > Maui starts segfaulting when a job with ppn >1 is submitted. > > You can see that maui is segfaulting by starting it in the foreground: > > /<path to maui>/maui -d > > then submit a job with ppn >1 and see if maui dies. > > Brian Andrus > > ------------------------------------------------------------------------ > *From:* [email protected] on behalf of Naveed Near-Ansari > *Sent:* Thu 2/17/2011 9:40 AM > *To:* [email protected] > *Subject:* [Mauiusers] maui commands hang, perhaps when scheduling > > > On our cluster we have been having issues with maui hanging. All of the > coomands hand and then timeout: > > showq > ERROR: lost connection to server > ERROR: cannot request service (status) > > I do not see anything enlightening in the logs, and torque commands > still work fine. I an not entirely certain, but it seems to happen when > it is actually scheduling something that the hang occurs. > > Do you have any ideas as to what could cause this? > > > I have copied the tail end of an strace when this occurs: > > write(3, "02/17 09:23:42 ServerProcessRequ"..., 39) = 39 > stat("/opt/maui/log/maui.log", {st_mode=S_IFREG|0640, st_size=2023991, > ...}) = 0 > write(3, "02/17 09:23:42 INFO: not rol"..., 63) = 63 > write(3, "02/17 09:23:42 MResAdjust(NULL,0"..., 36) = 36 > write(3, "02/17 09:23:42 MStatInitializeAc"..., 47) = 47 > write(3, "02/17 09:23:42 MStatClearUsage(["..., 46) = 46 > write(3, "02/17 09:23:42 ServerUpdate()\n", 30) = 30 > write(3, "02/17 09:23:42 MSysUpdateTime()\n", 32) = 32 > write(3, "02/17 09:23:42 INFO: startin"..., 47) = 47 > write(3, "02/17 09:23:42 MRMGetInfo()\n", 28) = 28 > write(3, "02/17 09:23:42 MClusterClearUsag"..., 36) = 36 > write(3, "02/17 09:23:42 MRMClusterQuery()"..., 33) = 33 > write(3, "02/17 09:23:42 MPBSClusterQuery("..., 48) = 48 > write(7, "+2+12+58+4maui+0+0+0", 20) = 20 > poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7, > revents=POLLIN}]) > fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) > read(7, "+2+1+0+0+63+515+32+13compute-27-"..., 262144) = 128000 > poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7, > revents=POLLIN}]) > fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) > read(7, "5 #1 SMP Mon Sep 20 07:12:06 EDT"..., 262077) = 128000 > poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7, > revents=POLLIN}]) > fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) > read(7, "8636kb,ncpus=8,loadave=0.00,netl"..., 261930) = 79914 > write(3, "02/17 09:23:42 __MPBSGetNodeStat"..., 52) = 52 > write(3, "02/17 09:23:42 INFO: PBS nod"..., 79) = 79 > write(3, "02/17 09:23:42 MPBSNodeUpdate(co"..., 72) = 72 > write(3, "02/17 09:23:42 MPBSLoadQueueInfo"..., 56) = 56 > write(7, "+2+12+20+4maui+0+0+0", 20) = 20 > poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 0 (Timeout) > write(3, "02/17 09:23:51 ERROR: cannot "..., 73) = 73 > write(7, "+2+12+59+4maui", 14) = 14 > rt_sigaction(SIGALRM, {0x1, [], SA_RESTORER, 0x3a910302d0}, {SIG_DFL, > [], SA_RESTORER, 0x3a910302d0}, 8) = 0 > alarm(9) = 0 > fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) > read(7, "+2+1+0+0+6+2+1+7default+92+212+1"..., 65536) = 696 > fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) > read(7, 0x2aac8a69b5e0, 65536) = ? ERESTARTSYS (To be restarted) > --- SIGALRM (Alarm clock) @ 0 (0) --- > read(7, > > _______________________________________________ > mauiusers mailing list > [email protected] > http://www.supercluster.org/mailman/listinfo/mauiusers >
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
