Looks like pbs_mom may not be running on the nodes.

 

Take a look at the output of pbsnodes.

See if there is corruption in the job list for nodes. If so, restart
pbs_server and then look.

I have seen where the job list has output of an 'env' command rather
than true jobs. This would cause maui to die as well.

 

Brian Andrus

ITACS/Research Computing

Naval Postgraduate School

Monterey, California

voice: 831-656-6238

 

From: Naveed Near-Ansari [mailto:[email protected]] 
Sent: Thursday, February 17, 2011 5:26 PM
To: Andrus, Brian Contractor
Cc: [email protected]
Subject: Re: [Mauiusers] maui commands hang, perhaps when scheduling

 

I did try that, but i am not seeing  obvious segfaults.

when hung the last things in the logs are:

2/17 16:58:40 INFO:     received service request from host
'headnode.caltech.edu'
02/17 16:58:40 MSURecvPacket(9,BufP,4,NULL,100000,SC)
02/17 16:58:41 ServerProcessRequests()
02/17 16:58:41 INFO:     not rolling logs (6196137 < 10000000)
02/17 16:58:41 MResAdjust(NULL,0,0)
02/17 16:58:41 MStatInitializeActiveSysUsage()
02/17 16:58:41 MStatClearUsage([NONE],Active)
02/17 16:58:41 ServerUpdate()
02/17 16:58:41 MSysUpdateTime()
02/17 16:58:41 INFO:     starting iteration 24
02/17 16:58:41 MRMGetInfo()
02/17 16:58:41 MClusterClearUsage()
02/17 16:58:41 MRMClusterQuery()
02/17 16:58:41 MPBSClusterQuery(base,RCount,SC)
02/17 16:58:41 ERROR:    cannot get node info: Execution server rejected
request MSG=connection to mom timed out

after restarting again, the logs show the following when hung:

02/17 17:15:02 INFO:     packet sent (4783 bytes of 4783)
02/17 17:15:02 MSUDisconnect(S)
02/17 17:15:04 ServerProcessRequests()
02/17 17:15:04 INFO:     not rolling logs (8856651 < 10000000)
02/17 17:15:04 MResAdjust(NULL,0,0)
02/17 17:15:04 MStatInitializeActiveSysUsage()
02/17 17:15:04 MStatClearUsage([NONE],Active)
02/17 17:15:04 ServerUpdate()
02/17 17:15:04 MSysUpdateTime()
02/17 17:15:04 INFO:     starting iteration 3
02/17 17:15:04 MRMGetInfo()
02/17 17:15:04 MClusterClearUsage()
02/17 17:15:04 MRMClusterQuery()
02/17 17:15:04 MPBSClusterQuery(base,RCount,SC)
02/17 17:15:04 ERROR:    cannot get node info: NULL



On 02/17/2011 11:12 AM, Andrus, Brian Contractor wrote: 

This says maui has stopped.

 

Here is something I am running into lately and maybe you are affected by
the same:

 

Maui starts segfaulting when a job with ppn >1 is submitted.

 

You can see that maui is segfaulting by starting it in the foreground:

 

/<path to maui>/maui -d

 

then submit a job with ppn >1 and see if maui dies.

 

Brian Andrus

 

________________________________

From: [email protected] on behalf of Naveed Near-Ansari
Sent: Thu 2/17/2011 9:40 AM
To: [email protected]
Subject: [Mauiusers] maui commands hang, perhaps when scheduling

 

On our cluster we have been having issues with maui hanging. All of the
coomands hand and then timeout:

 showq
ERROR:    lost connection to server
ERROR:    cannot request service (status)

I do not see anything enlightening in the logs, and torque commands
still work fine. I an not entirely certain, but it seems to happen when
it is actually scheduling something that the hang occurs.

Do you have any ideas as to what could cause this?


I have copied the tail end of an strace when this occurs:

write(3, "02/17 09:23:42 ServerProcessRequ"..., 39) = 39
stat("/opt/maui/log/maui.log", {st_mode=S_IFREG|0640, st_size=2023991,
...}) = 0
write(3, "02/17 09:23:42 INFO:     not rol"..., 63) = 63
write(3, "02/17 09:23:42 MResAdjust(NULL,0"..., 36) = 36
write(3, "02/17 09:23:42 MStatInitializeAc"..., 47) = 47
write(3, "02/17 09:23:42 MStatClearUsage(["..., 46) = 46
write(3, "02/17 09:23:42 ServerUpdate()\n", 30) = 30
write(3, "02/17 09:23:42 MSysUpdateTime()\n", 32) = 32
write(3, "02/17 09:23:42 INFO:     startin"..., 47) = 47
write(3, "02/17 09:23:42 MRMGetInfo()\n", 28) = 28
write(3, "02/17 09:23:42 MClusterClearUsag"..., 36) = 36
write(3, "02/17 09:23:42 MRMClusterQuery()"..., 33) = 33
write(3, "02/17 09:23:42 MPBSClusterQuery("..., 48) = 48
write(7, "+2+12+58+4maui+0+0+0", 20)    = 20
poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7,
revents=POLLIN}])
fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)
read(7, "+2+1+0+0+63+515+32+13compute-27-"..., 262144) = 128000
poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7,
revents=POLLIN}])
fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)
read(7, "5 #1 SMP Mon Sep 20 07:12:06 EDT"..., 262077) = 128000
poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 1 ([{fd=7,
revents=POLLIN}])
fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)
read(7, "8636kb,ncpus=8,loadave=0.00,netl"..., 261930) = 79914
write(3, "02/17 09:23:42 __MPBSGetNodeStat"..., 52) = 52
write(3, "02/17 09:23:42 INFO:     PBS nod"..., 79) = 79
write(3, "02/17 09:23:42 MPBSNodeUpdate(co"..., 72) = 72
write(3, "02/17 09:23:42 MPBSLoadQueueInfo"..., 56) = 56
write(7, "+2+12+20+4maui+0+0+0", 20)    = 20
poll([{fd=7, events=POLLIN|POLLHUP}], 1, 9000) = 0 (Timeout)
write(3, "02/17 09:23:51 ERROR:    cannot "..., 73) = 73
write(7, "+2+12+59+4maui", 14)          = 14
rt_sigaction(SIGALRM, {0x1, [], SA_RESTORER, 0x3a910302d0}, {SIG_DFL,
[], SA_RESTORER, 0x3a910302d0}, 8) = 0
alarm(9)                                = 0
fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)
read(7, "+2+1+0+0+6+2+1+7default+92+212+1"..., 65536) = 696
fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)
read(7, 0x2aac8a69b5e0, 65536)          = ? ERESTARTSYS (To be
restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
read(7,

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to