I had a similar situation with a user submitting 7,000 jobs at a time.
Like you point out maui can't seem to keep up with scheduling all of
them. After posting to the list it was suggested that I create a routing
queue in torque:
create queue physics
set queue physics queue_type = Route
set queue physics acl_group_enable = True
set queue physics route_destinations = pompeii
set queue physics enabled = True
set queue physics started = True
Then for the destination queue pompeii I put in the following rule:
set queue pompeii max_queuable = 50
This setup is working well. Torque manages to keep 50 jobs in the
pompeii execution queue at all times. Maui is happy since it doesn't
have to go through thousands of jobs each iteration, which it couldn't
run anyhow due to lack of resources. (I wish we had thousands ;-)).
Please note that ANY! newer jobs that might trigger preemption will NO
LONGER WORK with this setup, since maui is only using its scheduling
algorithms on those 50 jobs.
Same with higher prio jobs or similar that will/must/should be executed
ASAP.
You essentially turn your setup into a "50 jobs a at time" batching system.
So, depending on your needs you should increase the max_queueable.
Before maui I managed to run a heavily patched pbs_sched (early torque
releases) with I think around 20k+ jobs queued.
After that I abandoned that setup because I needed preemption (sorry, no
docs left from that time).
I had maui running with 10k+ jobs (and changed the #define so it would
consider 8K instead of 4K jobs for real scheduling), but it's not nice
and it'll eat memory like it's sugar (500MB+ RSS).
And I still think scheduling over 8K jobs is far too less for such a system.
Because ATM I no longer have this setup in operation I did stop working
privately on maui to remedy those shortcommings.
BR,
Ronny
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers