Yes, Kenneth, this is a good idea. I had already contacted TACC (Chris Hempel) and he is looking into this already. At this point, we will likely get some relieve on Lonestar, which is no longer an XSEDE resource, but as a UT member I continue to get access there through a separate non-XSEDE UT-only allocation. They have less pressure on Lonestar now that XSEDE users are no longer using it.
-Borries On Tue, Sep 02, 2014 at 09:46:57AM -0700, K Yoshimoto wrote: > > You could try requesting an increased job limit for the community user. > SDSC sets different queued job limits for gateway vs individual users. > I think TACC would probably be receptive to that. > > On Tue, Sep 02, 2014 at 09:11:00AM -0500, Borries Demeler wrote: > > Our application involves submission of several hundred quite small (a > > couple of minutes for most > > clusters, ~128 cores, give or take) computational jobs, running the same > > code on multiple datasets. > > > > We are hitting the limit of 50 jobs on TACC resources, with all others > > failing. The problem is > > made worse because all users submit under a community account, which treats > > every submission to > > be part of the same allocation account. > > > > I see a few possibilities: > > > > 1. a separate FIFO queue, making sure none of the resources get overloaded > > by any community account user > > > > 2. submitting all jobs as a single job somehow to where the job is > > submitted for the aggregate walltime > > for all jobs. A special workscript would spawn jobs underneath the parent > > submission. Not sure if this > > is feasable or reasonable. > > > > 3. spreading the jobs around all possible resources > > > > 4. a combination of 1 and 3. > > > > -Borries > > > > > > > > > > On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote: > > > Hi All, > > > > > > Need some guidance on identifying a scheduling strategy and a pluggable > > > third party implementation for airavata scheduling needs. For context let > > > me describe the use cases for scheduling within airavata: > > > > > > * If we gateway/user is submitting a series of jobs, airavata is > > > currently not throttling them and sending them to compute clusters (in a > > > FIFO way). Resources enforce per user job limit within a queue and ensure > > > fair use of the clusters ((example: stampede allows 50 jobs per user in > > > the normal queue [1]). Airavata will need to implement queues and > > > throttle jobs respecting the max-job-per-queue limits of a underlying > > > resource queue. > > > > > > * Current version of Airavata is also not performing job scheduling > > > across available computational resources and expecting gateways/users to > > > pick resources during experiment launch. Airavata will need to implement > > > schedulers which become aware of existing loads on the clusters and > > > spread jobs efficiently. The scheduler should be able to get access to > > > heuristics on previous executions and current requirements which includes > > > job size (number of nodes/cores), memory requirements, wall time > > > estimates and so forth. > > > > > > * As Airavata is mapping multiple individual user jobs into one or more > > > community account submissions, it also becomes critical to implement > > > fair-share scheduling among these users to ensure fair use of allocations > > > as well as allowable queue limits. > > > > > > Other use cases? > > > > > > We will greatly appreciate if folks on this list can shed light on > > > experiences using schedulers implemented in hadoop, mesos, storm or other > > > frameworks outside of their intended use. For instance, hadoop (yarn) > > > capacity [2] and fair schedulers [3][4][5] seem to meet the needs of > > > airavata. Is it a good idea to attempt to reuse these implementations? > > > Any other pluggable third-party alternatives. > > > > > > Thanks in advance for your time and insights, > > > > > > Suresh > > > > > > [1] - > > > https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running > > > [2] - > > > http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html > > > [3] - > > > http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html > > > [4] - https://issues.apache.org/jira/browse/HADOOP-3746 > > > [5] - https://issues.apache.org/jira/browse/YARN-326 > > > > > >
