On Mon, Jan 24, 2022 at 01:17:30PM -0600, Tom Harvill wrote: > > > Hello, > > We use a 'fair share' feature of our scheduler (SLURM) and have our decay > half-life (the time needed for priority penalty to halve) set to 30 days. > Our maximum job runtime is 7 days. I'm wondering what others use, please > let me know if you can spare a minute. Thank you!
We're a Grid Engine shop, not SLURM, but a few years ago we significantly reduced the weight of the fair-share policy and boosted the relative weight of the functional policy. The problem we were having was that the fair-share policy would take a long time to adjust to sudden changes in usage and trying to determine what someone's priority would be/should be based on prior usage could be pretty challenging. The functional policy adjusts immediately based on current workload and is a lot easier to comprehend for our users. I'm not sure what the equivalent of the functional policy is in SLURM but in GE it's ticket-based where accounts, projects, and "departments" (labs, in our context) are given some number of tickets which are consumed by running jobs, and returned when the job finishes. By default, every job from a single source has an equal share of tickets, but that share is adjustable on submission so a user can assign a relative importance to their own jobs. We also use the urgency policy heavily, where the resource requests of a job influence its final priority. This lets us boost the priority for jobs requesting hard-to-satisfy resources (lots of memory on one node, GPUs, etc.) to avoid starving them amongst a swarm of tiny jobs. Schedule policy is a really iterative process and took us a long time to tweak to everyone's (mostly) satisfaction. -- Skylar _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf