I had a similar situation with a user submitting 7,000 jobs at a time. Like you point out maui can't seem to keep up with scheduling all of them. After posting to the list it was suggested that I create a routing queue in torque:

create queue physics
set queue physics queue_type = Route
set queue physics acl_group_enable = True
set queue physics route_destinations = pompeii
set queue physics enabled = True
set queue physics started = True

Then for the destination queue pompeii I put in the following rule:

set queue pompeii max_queuable = 50

This setup is working well. Torque manages to keep 50 jobs in the pompeii execution queue at all times. Maui is happy since it doesn't have to go through thousands of jobs each iteration, which it couldn't run anyhow due to lack of resources. (I wish we had thousands ;-)).

Please note that ANY! newer jobs that might trigger preemption will NO LONGER WORK with this setup, since maui is only using its scheduling algorithms on those 50 jobs. Same with higher prio jobs or similar that will/must/should be executed ASAP.
You essentially turn your setup into a "50 jobs a at time" batching system.

So, depending on your needs you should increase the max_queueable.


Before maui I managed to run a heavily patched pbs_sched (early torque releases) with I think around 20k+ jobs queued.

After that I abandoned that setup because I needed preemption (sorry, no docs left from that time). I had maui running with 10k+ jobs (and changed the #define so it would consider 8K instead of 4K jobs for real scheduling), but it's not nice and it'll eat memory like it's sugar (500MB+ RSS).
And I still think scheduling over 8K jobs is far too less for such a system.


Because ATM I no longer have this setup in operation I did stop working privately on maui to remedy those shortcommings.

BR,
Ronny

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to