Re: [Mauiusers] Help with Maui

Steve Young Thu, 26 Jun 2008 05:52:30 -0700


On Jun 25, 2008, at 5:42 PM, Jeremy Mann wrote:

First, Maui has pretty much solved our problems with submitting 100k+
jobs. It was simply overloading the stock PBS scheduler. But now I'm
coming into a (maybe) configuration problem.

I have one user who submits 100k+ jobs. This works fine for the first
couple thousand jobs, which line up in the queue and begin to execute.
However after a certain time, the queue is only visibily filledwith 15jobs. There are still thousands of jobs hidden. I can see this if Irunqstat multiple times in a row, the queue is repopulated with 15more jobs.The jobs only run for about 45 seconds so I'm thinking Maui isn'tpicking
up on this?

I had a similar situation with a user submitting 7,000 jobs at atime. Like you point out maui can't seem to keep up with schedulingall of them. After posting to the list it was suggested that I createa routing queue in torque:


create queue physics
set queue physics queue_type = Route
set queue physics acl_group_enable = True
set queue physics route_destinations = pompeii
set queue physics enabled = True
set queue physics started = True

Then for the destination queue pompeii I put in the following rule:

set queue pompeii max_queuable = 50

This setup is working well. Torque manages to keep 50 jobs in thepompeii execution queue at all times. Maui is happy since it doesn'thave to go through thousands of jobs each iteration, which itcouldn't run anyhow due to lack of resources. (I wish we hadthousands ;-)).

My second question is probably more important. We run two PBSenvironmentson all of our clusters. One is our 'default' high priority queueand thesecond is the 'default' low priority queue. The low priority queueis forjobs that run at nice 19 or 20. So we load up the low priorityqueue withniced jobs and don't care how long they take to finish. This leavesthe
high priority queue to process our own grid and MPI jobs.
This has worked fine for awhile, but now I have a user who wants torun afew hundred thousand jobs in our low priority queue (see paragraph1 and
2). The stock pbs_sched was simply getting overloaded and would crash.
This is when I set up a test cluster using PBS/Maui and we haven'thad a
problem (other than the 15 queue limit I spoke of before).
Yesterday, I set up a second Maui to schedule the low priorityqueue. I
could submit jobs, check job status, however the jobs would never run.
This is when I started checking the Maui logs and found checksumerrors.This is when I discovered the problem. The $PATH environment picksup thenormally installed Maui and uses its binaries to perform itsfunctions.Turns out the checksum error is when the normally installed Mauitries toprocess and query the second low priority queue. I can get aroundthis by
using the second installed Maui's binaries to query the low priority
queue. If there is a way to disable this at compilation time, Ithink my
problem will go away.

I look forward to any comments or questions!

As for having two separate instances of torque/maui I'm not sure whyyou would want to do this? Are they load balanced or something? Oneinstance of torque/maui should be able to handle everything you needplus you won't have this confusion of keeping track of multiplebinaries/installations.


-Steve






_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Re: [Mauiusers] Help with Maui

Reply via email to