You could try requesting an increased job limit for the community user. SDSC sets different queued job limits for gateway vs individual users. I think TACC would probably be receptive to that.
On Tue, Sep 02, 2014 at 09:11:00AM -0500, Borries Demeler wrote: > Our application involves submission of several hundred quite small (a couple > of minutes for most > clusters, ~128 cores, give or take) computational jobs, running the same code > on multiple datasets. > > We are hitting the limit of 50 jobs on TACC resources, with all others > failing. The problem is > made worse because all users submit under a community account, which treats > every submission to > be part of the same allocation account. > > I see a few possibilities: > > 1. a separate FIFO queue, making sure none of the resources get overloaded by > any community account user > > 2. submitting all jobs as a single job somehow to where the job is submitted > for the aggregate walltime > for all jobs. A special workscript would spawn jobs underneath the parent > submission. Not sure if this > is feasable or reasonable. > > 3. spreading the jobs around all possible resources > > 4. a combination of 1 and 3. > > -Borries > > > > > On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote: > > Hi All, > > > > Need some guidance on identifying a scheduling strategy and a pluggable > > third party implementation for airavata scheduling needs. For context let > > me describe the use cases for scheduling within airavata: > > > > * If we gateway/user is submitting a series of jobs, airavata is currently > > not throttling them and sending them to compute clusters (in a FIFO way). > > Resources enforce per user job limit within a queue and ensure fair use of > > the clusters ((example: stampede allows 50 jobs per user in the normal > > queue [1]). Airavata will need to implement queues and throttle jobs > > respecting the max-job-per-queue limits of a underlying resource queue. > > > > * Current version of Airavata is also not performing job scheduling across > > available computational resources and expecting gateways/users to pick > > resources during experiment launch. Airavata will need to implement > > schedulers which become aware of existing loads on the clusters and spread > > jobs efficiently. The scheduler should be able to get access to heuristics > > on previous executions and current requirements which includes job size > > (number of nodes/cores), memory requirements, wall time estimates and so > > forth. > > > > * As Airavata is mapping multiple individual user jobs into one or more > > community account submissions, it also becomes critical to implement > > fair-share scheduling among these users to ensure fair use of allocations > > as well as allowable queue limits. > > > > Other use cases? > > > > We will greatly appreciate if folks on this list can shed light on > > experiences using schedulers implemented in hadoop, mesos, storm or other > > frameworks outside of their intended use. For instance, hadoop (yarn) > > capacity [2] and fair schedulers [3][4][5] seem to meet the needs of > > airavata. Is it a good idea to attempt to reuse these implementations? Any > > other pluggable third-party alternatives. > > > > Thanks in advance for your time and insights, > > > > Suresh > > > > [1] - > > https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running > > [2] - > > http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html > > [3] - > > http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html > > [4] - https://issues.apache.org/jira/browse/HADOOP-3746 > > [5] - https://issues.apache.org/jira/browse/YARN-326 > > > >
