Pankaj, this is hardly expert advice, but I've seen similar behavior in the
past, and this is generally the diagnostic pattern I follow:
1. If your current configuration is the same as it has always been and is only
recently having problems, my first thought (right or wrong) would be that it
may be experiencing higher demands than in the past; for example:
A. Is it possible that Maui is servicing more user requests than in the past?
Lots of users executing status utilities can slow Maui down to the point of
unresponsiveness. This is especially likely if the client utilities are
installed on the compute nodes and/or if your site has local wrappers that run
client commands automatically.
B. Is Maui scheduling more jobs than in the past? Many of Maui's throttling
and fairness policies can be safely omitted until your typical workload reaches
the point where throttling policies must be used just to keep each scheduling
iteration and status query down to a reasonable amount of computation; it seems
that Maui favors scheduling over client queries (and rightly so), so the fact
that scheduling is OK may not preclude this.
C. Have Maui's state and/or statistics ever been reset? I've seen bad
'maui.ck' files resulting from previous crashes and/or excessive/defective job
submissions; bad statistics may affect client queries more than scheduling
operations as well.
2. I don't think debug mode will help if Maui is running slowly; I think this
will just slow it down further. Increasing log verbosity slightly may help to
identify bad jobs or internal errors, but I've had better luck just observing
its external characteristics like memory or CPU utilization. Try suspending
the scheduler ('schedctl -s') and see if it stabilizes (my favorite "ping" to
see how quickly Maui can process queries is 'showconfig|head', but any
non-demanding status query will do); you can then try more complex queries
and/or fiddle with various combinations of schedctl's "-r", "-s", and "-S"
option to get a rough idea what sort of operations cause problems.
3. (CAUTION!) If you're desperate, it may help to back up your state
(maui.ck) and stats/ and either run 'resetstats' or delete these files and
restart Maui to see if it can recover better without them. I don't really
recommend this, as at the very least it can screw up your fairshare stats and
user histories and at worst it can exacerbate the problem and/or create new
ones (especially if the source of the problem is external), but if historical
statistics are not important to you this may be another way to reduce the
number of unknowns. If this does cause problems I won't be able to help you
fix them (and since Maui was deprecated in favor of Moab a long time ago, I
wouldn't expect anyone else would volunteer much assistance either) so again I
don't recommend it if you aren't already 100% confident in your ability to
diagnose/fix Maui problems; I provide this mostly just for context.
Sorry I can't offer more, but hopefully this will help you think through the
symptoms you're observing; long story short, you may just need to upgrade the
server running your scheduler if its old configuration can't keep up with your
current workload and service status queries at the same time. Feel free to
post back with more questions if you would like clarification on any of the
above (especially #1 or #2 -- again, avoid #3 if you are on a production
cluster and have any doubts on how to proceed).
Good luck...
Phil Regier, I.S. Analyst
Univ. of Kansas Advanced Computing Facility
----- Original Message -----
From: "Pankaj Dorlikar" <[email protected]>
To: "mauiusers" <[email protected]>
Sent: Monday, February 24, 2014 9:47:20 AM
Subject: [Mauiusers] scheduler issue
Hi,
We have maui - 3.2.6p21 and torque version : 2.5.8. when we try to execute
maui related commands on server itself or from clients also, it takes very
long time or times out saying
INFO: client has disconnected, errno: 104 (Connection reset by peer)
ERROR: lost connection to server
ERROR: cannot request service (status)
This setup is from long time and used to work fine. Currently, scheduling
is working fine. Onl;y commands are creating an issue. The setup is
operational and jobs are running.
Logs of maui and pbs do not say anything. also, network and related
services are also fine. How to debug where it is going wrong. is starting
maui in debug mode going to help?
--
Pankaj V. Dorlikar
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers