I did try that, but i am not seeing  obvious segfaults.

when hung the last things in the logs are:

2/17 16:58:40 INFO:     received service request from host
'headnode.caltech.edu'
02/17 16:58:40 MSURecvPacket(9,BufP,4,NULL,100000,SC)
02/17 16:58:41 ServerProcessRequests()
02/17 16:58:41 INFO:     not rolling logs (6196137 < 10000000)
02/17 16:58:41 MResAdjust(NULL,0,0)
02/17 16:58:41 MStatInitializeActiveSysUsage()
02/17 16:58:41 MStatClearUsage([NONE],Active)
02/17 16:58:41 ServerUpdate()
02/17 16:58:41 MSysUpdateTime()
02/17 16:58:41 INFO:     starting iteration 24
02/17 16:58:41 MRMGetInfo()
02/17 16:58:41 MClusterClearUsage()
02/17 16:58:41 MRMClusterQuery()
02/17 16:58:41 MPBSClusterQuery(base,RCount,SC)
02/17 16:58:41 ERROR:    cannot get node info: Execution server rejected
request MSG=connection to mom timed out

after restarting again, the logs show the following when hung:

02/17 17:15:02 INFO:     packet sent (4783 bytes of 4783)
02/17 17:15:02 MSUDisconnect(S)
02/17 17:15:04 ServerProcessRequests()
02/17 17:15:04 INFO:     not rolling logs (8856651 < 10000000)
02/17 17:15:04 MResAdjust(NULL,0,0)
02/17 17:15:04 MStatInitializeActiveSysUsage()
02/17 17:15:04 MStatClearUsage([NONE],Active)
02/17 17:15:04 ServerUpdate()
02/17 17:15:04 MSysUpdateTime()
02/17 17:15:04 INFO:     starting iteration 3
02/17 17:15:04 MRMGetInfo()
02/17 17:15:04 MClusterClearUsage()
02/17 17:15:04 MRMClusterQuery()
02/17 17:15:04 MPBSClusterQuery(base,RCount,SC)
02/17 17:15:04 ERROR:    cannot get node info: NULL



On 02/17/2011 11:12 AM, Andrus, Brian Contractor wrote:
> This says maui has stopped.
>  
> Here is something I am running into lately and maybe you are affected
> by the same:
>  
> Maui starts segfaulting when a job with ppn >1 is submitted.
>  
> You can see that maui is segfaulting by starting it in the foreground:
>  
> /<path to maui>/maui -d
>  
> then submit a job with ppn >1 and see if maui dies.
>  
> Brian Andrus
>
> ------------------------------------------------------------------------
> *From:* [email protected] on behalf of Naveed Near-Ansari
> *Sent:* Thu 2/17/2011 9:40 AM
> *To:* [email protected]
> *Subject:* [Mauiusers] maui commands hang, perhaps when scheduling
>
>
> On our cluster we have been having issues with maui hanging. All of the
> coomands hand and then timeout:
>
>  showq
> ERROR:    lost connection to server
> ERROR:    cannot request service (status)
>
> I do not see anything enlightening in the logs, and torque commands
> still work fine. I an not entirely certain, but it seems to happen when
> it is actually scheduling something that the hang occurs.
>
> Do you have any ideas as to what could cause this?
>
>
> I have copied the tail end of an strace when this occurs:
>
> write(3, "02/17 09:23:42 ServerProcessRequ"..., 39) = 39
> stat("/opt/maui/log/maui.log", {st_mode=S_IFREG|0640, st_size=2023991,
> ...}) = 0
> write(3, "02/17 09:23:42 INFO:     not rol"..., 63) = 63
> write(3, "02/17 09:23:42 MResAdjust(NULL,0"..., 36) = 36
> write(3, "02/17 09:23:42 MStatInitializeAc"..., 47) = 47
> write(3, "02/17 09:23:42 MStatClearUsage(["..., 46) = 46
> write(3, "02/17 09:23:42 ServerUpdate()\n", 30) = 30
> write(3, "02/17 09:23:42 MSysUpdateTime()\n", 32) = 32
> write(3, "02/17 09:23:42 INFO:     startin"..., 47) = 47
> write(3, "02/17 09:23:42 MRMGetInfo()\n", 28) = 28
> write(3, "02/17 09:23:42 MClusterClearUsag"..., 36) = 36
> write(3, "02/17 09:23:42 MRMClusterQuery()"..., 33) = 33
> write(3, "02/17 09:23:42 MPBSClusterQuery("..., 48) = 48
> write(7, "+2+12+58+4maui+0+0+0", 20)    = 20
> poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7,
> revents=POLLIN}])
> fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)
> read(7, "+2+1+0+0+63+515+32+13compute-27-"..., 262144) = 128000
> poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7,
> revents=POLLIN}])
> fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)
> read(7, "5 #1 SMP Mon Sep 20 07:12:06 EDT"..., 262077) = 128000
> poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7,
> revents=POLLIN}])
> fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)
> read(7, "8636kb,ncpus=8,loadave=0.00,netl"..., 261930) = 79914
> write(3, "02/17 09:23:42 __MPBSGetNodeStat"..., 52) = 52
> write(3, "02/17 09:23:42 INFO:     PBS nod"..., 79) = 79
> write(3, "02/17 09:23:42 MPBSNodeUpdate(co"..., 72) = 72
> write(3, "02/17 09:23:42 MPBSLoadQueueInfo"..., 56) = 56
> write(7, "+2+12+20+4maui+0+0+0", 20)    = 20
> poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 0 (Timeout)
> write(3, "02/17 09:23:51 ERROR:    cannot "..., 73) = 73
> write(7, "+2+12+59+4maui", 14)          = 14
> rt_sigaction(SIGALRM, {0x1, [], SA_RESTORER, 0x3a910302d0}, {SIG_DFL,
> [], SA_RESTORER, 0x3a910302d0}, 8) = 0
> alarm(9)                                = 0
> fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)
> read(7, "+2+1+0+0+6+2+1+7default+92+212+1"..., 65536) = 696
> fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)
> read(7, 0x2aac8a69b5e0, 65536)          = ? ERESTARTSYS (To be restarted)
> --- SIGALRM (Alarm clock) @ 0 (0) ---
> read(7,
>
> _______________________________________________
> mauiusers mailing list
> [email protected]
> http://www.supercluster.org/mailman/listinfo/mauiusers
>
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to