Pankaj, this is hardly expert advice, but I've seen similar behavior in the 
past, and this is generally the diagnostic pattern I follow:

1.  If your current configuration is the same as it has always been and is only 
recently having problems, my first thought (right or wrong) would be that it 
may be experiencing higher demands than in the past; for example:
 A.  Is it possible that Maui is servicing more user requests than in the past? 
 Lots of users executing status utilities can slow Maui down to the point of 
unresponsiveness.  This is especially likely if the client utilities are 
installed on the compute nodes and/or if your site has local wrappers that run 
client commands automatically.
 B.  Is Maui scheduling more jobs than in the past?  Many of Maui's throttling 
and fairness policies can be safely omitted until your typical workload reaches 
the point where throttling policies must be used just to keep each scheduling 
iteration and status query down to a reasonable amount of computation; it seems 
that Maui favors scheduling over client queries (and rightly so), so the fact 
that scheduling is OK may not preclude this.
 C.  Have Maui's state and/or statistics ever been reset?  I've seen bad 
'maui.ck' files resulting from previous crashes and/or excessive/defective job 
submissions; bad statistics may affect client queries more than scheduling 
operations as well.

2.  I don't think debug mode will help if Maui is running slowly; I think this 
will just slow it down further.  Increasing log verbosity slightly may help to 
identify bad jobs or internal errors, but I've had better luck just observing 
its external characteristics like memory or CPU utilization.  Try suspending 
the scheduler ('schedctl -s') and see if it stabilizes (my favorite "ping" to 
see how quickly Maui can process queries is 'showconfig|head', but any 
non-demanding status query will do); you can then try more complex queries 
and/or fiddle with various combinations of schedctl's "-r", "-s", and "-S" 
option to get a rough idea what sort of operations cause problems.

3.  (CAUTION!)  If you're desperate, it may help to back up your state 
(maui.ck) and stats/ and either run 'resetstats' or delete these files and 
restart Maui to see if it can recover better without them.  I don't really 
recommend this, as at the very least it can screw up your fairshare stats and 
user histories and at worst it can exacerbate the problem and/or create new 
ones (especially if the source of the problem is external), but if historical 
statistics are not important to you this may be another way to reduce the 
number of unknowns.  If this does cause problems I won't be able to help you 
fix them (and since Maui was deprecated in favor of Moab a long time ago, I 
wouldn't expect anyone else would volunteer much assistance either) so again I 
don't recommend it if you aren't already 100% confident in your ability to 
diagnose/fix Maui problems; I provide this mostly just for context.

Sorry I can't offer more, but hopefully this will help you think through the 
symptoms you're observing; long story short, you may just need to upgrade the 
server running your scheduler if its old configuration can't keep up with your 
current workload and service status queries at the same time.  Feel free to 
post back with more questions if you would like clarification on any of the 
above (especially #1 or #2 -- again, avoid #3 if you are on a production 
cluster and have any doubts on how to proceed).

Good luck...

Phil Regier, I.S. Analyst
Univ. of Kansas Advanced Computing Facility


----- Original Message -----
From: "Pankaj Dorlikar" <[email protected]>
To: "mauiusers" <[email protected]>
Sent: Monday, February 24, 2014 9:47:20 AM
Subject: [Mauiusers] scheduler issue

Hi,

 We have maui - 3.2.6p21 and torque version : 2.5.8. when we try to execute
maui related commands on server itself or from clients also, it takes very
long time or times out saying

INFO:     client has disconnected, errno: 104 (Connection reset by peer)
ERROR:    lost connection to server
ERROR:    cannot request service (status)

This setup is from long time and used to work fine. Currently, scheduling
is working fine. Onl;y commands are creating an issue. The setup is
operational and jobs are running.

Logs of maui and pbs do not say anything. also, network and related
services are also fine. How to debug where it is going wrong. is starting
maui in debug mode going to help?

-- 
Pankaj V. Dorlikar

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to