This says maui has stopped. Here is something I am running into lately and maybe you are affected by the same: Maui starts segfaulting when a job with ppn >1 is submitted. You can see that maui is segfaulting by starting it in the foreground: /<path to maui>/maui -d then submit a job with ppn >1 and see if maui dies. Brian Andrus
________________________________ From: [email protected] on behalf of Naveed Near-Ansari Sent: Thu 2/17/2011 9:40 AM To: [email protected] Subject: [Mauiusers] maui commands hang, perhaps when scheduling On our cluster we have been having issues with maui hanging. All of the coomands hand and then timeout: showq ERROR: lost connection to server ERROR: cannot request service (status) I do not see anything enlightening in the logs, and torque commands still work fine. I an not entirely certain, but it seems to happen when it is actually scheduling something that the hang occurs. Do you have any ideas as to what could cause this? I have copied the tail end of an strace when this occurs: write(3, "02/17 09:23:42 ServerProcessRequ"..., 39) = 39 stat("/opt/maui/log/maui.log", {st_mode=S_IFREG|0640, st_size=2023991, ...}) = 0 write(3, "02/17 09:23:42 INFO: not rol"..., 63) = 63 write(3, "02/17 09:23:42 MResAdjust(NULL,0"..., 36) = 36 write(3, "02/17 09:23:42 MStatInitializeAc"..., 47) = 47 write(3, "02/17 09:23:42 MStatClearUsage(["..., 46) = 46 write(3, "02/17 09:23:42 ServerUpdate()\n", 30) = 30 write(3, "02/17 09:23:42 MSysUpdateTime()\n", 32) = 32 write(3, "02/17 09:23:42 INFO: startin"..., 47) = 47 write(3, "02/17 09:23:42 MRMGetInfo()\n", 28) = 28 write(3, "02/17 09:23:42 MClusterClearUsag"..., 36) = 36 write(3, "02/17 09:23:42 MRMClusterQuery()"..., 33) = 33 write(3, "02/17 09:23:42 MPBSClusterQuery("..., 48) = 48 write(7, "+2+12+58+4maui+0+0+0", 20) = 20 poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7, revents=POLLIN}]) fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) read(7, "+2+1+0+0+63+515+32+13compute-27-"..., 262144) = 128000 poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7, revents=POLLIN}]) fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) read(7, "5 #1 SMP Mon Sep 20 07:12:06 EDT"..., 262077) = 128000 poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7, revents=POLLIN}]) fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) read(7, "8636kb,ncpus=8,loadave=0.00,netl"..., 261930) = 79914 write(3, "02/17 09:23:42 __MPBSGetNodeStat"..., 52) = 52 write(3, "02/17 09:23:42 INFO: PBS nod"..., 79) = 79 write(3, "02/17 09:23:42 MPBSNodeUpdate(co"..., 72) = 72 write(3, "02/17 09:23:42 MPBSLoadQueueInfo"..., 56) = 56 write(7, "+2+12+20+4maui+0+0+0", 20) = 20 poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 0 (Timeout) write(3, "02/17 09:23:51 ERROR: cannot "..., 73) = 73 write(7, "+2+12+59+4maui", 14) = 14 rt_sigaction(SIGALRM, {0x1, [], SA_RESTORER, 0x3a910302d0}, {SIG_DFL, [], SA_RESTORER, 0x3a910302d0}, 8) = 0 alarm(9) = 0 fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) read(7, "+2+1+0+0+6+2+1+7default+92+212+1"..., 65536) = 696 fcntl(7, F_GETFL) = 0x2 (flags O_RDWR) read(7, 0x2aac8a69b5e0, 65536) = ? ERESTARTSYS (To be restarted) --- SIGALRM (Alarm clock) @ 0 (0) --- read(7, _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
