On our cluster we have been having issues with maui hanging. All of the
coomands hand and then timeout:
showq
ERROR: lost connection to server
ERROR: cannot request service (status)
I do not see anything enlightening in the logs, and torque commands
still work fine. I an not entirely certain, but it seems to happen when
it is actually scheduling something that the hang occurs.
Do you have any ideas as to what could cause this?
I have copied the tail end of an strace when this occurs:
write(3, "02/17 09:23:42 ServerProcessRequ"..., 39) = 39
stat("/opt/maui/log/maui.log", {st_mode=S_IFREG|0640, st_size=2023991,
...}) = 0
write(3, "02/17 09:23:42 INFO: not rol"..., 63) = 63
write(3, "02/17 09:23:42 MResAdjust(NULL,0"..., 36) = 36
write(3, "02/17 09:23:42 MStatInitializeAc"..., 47) = 47
write(3, "02/17 09:23:42 MStatClearUsage(["..., 46) = 46
write(3, "02/17 09:23:42 ServerUpdate()\n", 30) = 30
write(3, "02/17 09:23:42 MSysUpdateTime()\n", 32) = 32
write(3, "02/17 09:23:42 INFO: startin"..., 47) = 47
write(3, "02/17 09:23:42 MRMGetInfo()\n", 28) = 28
write(3, "02/17 09:23:42 MClusterClearUsag"..., 36) = 36
write(3, "02/17 09:23:42 MRMClusterQuery()"..., 33) = 33
write(3, "02/17 09:23:42 MPBSClusterQuery("..., 48) = 48
write(7, "+2+12+58+4maui+0+0+0", 20) = 20
poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7,
revents=POLLIN}])
fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)
read(7, "+2+1+0+0+63+515+32+13compute-27-"..., 262144) = 128000
poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7,
revents=POLLIN}])
fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)
read(7, "5 #1 SMP Mon Sep 20 07:12:06 EDT"..., 262077) = 128000
poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7,
revents=POLLIN}])
fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)
read(7, "8636kb,ncpus=8,loadave=0.00,netl"..., 261930) = 79914
write(3, "02/17 09:23:42 __MPBSGetNodeStat"..., 52) = 52
write(3, "02/17 09:23:42 INFO: PBS nod"..., 79) = 79
write(3, "02/17 09:23:42 MPBSNodeUpdate(co"..., 72) = 72
write(3, "02/17 09:23:42 MPBSLoadQueueInfo"..., 56) = 56
write(7, "+2+12+20+4maui+0+0+0", 20) = 20
poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 0 (Timeout)
write(3, "02/17 09:23:51 ERROR: cannot "..., 73) = 73
write(7, "+2+12+59+4maui", 14) = 14
rt_sigaction(SIGALRM, {0x1, [], SA_RESTORER, 0x3a910302d0}, {SIG_DFL,
[], SA_RESTORER, 0x3a910302d0}, 8) = 0
alarm(9) = 0
fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)
read(7, "+2+1+0+0+6+2+1+7default+92+212+1"..., 65536) = 696
fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)
read(7, 0x2aac8a69b5e0, 65536) = ? ERESTARTSYS (To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
read(7,
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers