What happens in the situation where every running/queued job has come from a non-Airavata source? Airavata won't receive events for them as they complete/fail.
It's a rare edge case, but it still needs to be considered, I think. On Aug 3, 2015 2:01 PM, "Pierce, Marlon" <[email protected]> wrote: > If you are at or near the queue limit, you can wait for a currently > running job to complete (and Airavata receives the “completed” or “failed” > event) so that you have an empty slot. > > Marlon > > > From: John Weachock <[email protected]> > Reply-To: dev <[email protected]> > Date: Monday, August 3, 2015 at 1:51 PM > To: dev <[email protected]> > Subject: Re: Job Submission Limit > > I still have some questions about this method. > > When we reach the policy limit and move rejected jobs into our queue, how > will we determine when it's safe to attempt submission again? A regular > ticking event, such as every 5 minutes? Or is there another way? > > What types of rejection messages/codes will we receive? For example, what > happens if a job is rejected because it requests too many resources, rather > than exceeding the number of jobs? > On Aug 3, 2015 1:40 PM, "K Yoshimoto" <[email protected]> wrote: > >> >> Yes, that's the idea. In general, something dynamic and adaptable >> will probably be more robust than a rigid limit. >> >> On Mon, Aug 03, 2015 at 01:15:50PM -0400, John Weachock wrote: >> > Ah! I think I understand what you're saying now. Rather than trying to >> > ensure we stay within the policy limits, we should just submit a job and >> > check if it was accepted or not. If it was rejected, we can add it to a >> > queue to be resubmitted at a later time or to a different resource. Is >> this >> > correct? >> > >> > On Mon, Aug 3, 2015 at 1:10 PM, K Yoshimoto <[email protected]> wrote: >> > >> > > >> > > The point is that the policy limit could change at any time. >> > > If it does, and there is a mismatch in the limit at the resource >> > > and the limit in Airavata, bad things will happen. Schedulers >> > > will vary in the format of their policy limit output, so it's >> > > more reliable to monitor actual job submissions and handle failures. >> > > Remember that it's possible for job limits to vary for a single >> > > resource not only on queue name, but on job characteristics, >> > > such as allocation account, core count, wall clock limit, etc. >> > > >> > > On Mon, Aug 03, 2015 at 12:53:22PM -0400, Raminderjeet Singh wrote: >> > > > Usually these limits are set as a policy by the resource provider >> and do >> > > > not change usually. As long as we have a place holder to >> configure/change >> > > > it in Airavata for a user/gateway, we don't need to get it from a >> > > resource. >> > > > >> > > > >> > > > On Mon, Aug 3, 2015 at 12:33 PM, John Weachock <[email protected] >> > >> > > wrote: >> > > > >> > > > > I think it would be best for us to not maintain our own record of >> the >> > > job >> > > > > limit - we need to remember that jobs will be submitted to these >> > > resources >> > > > > using the community accounts through other methods as well. I >> think I >> > > > > remember someone mentioning that it would be ideal to poll the >> > > resources >> > > > > for their limits - can anyone confirm that we can do this? >> > > > > >> > > > > On Mon, Aug 3, 2015 at 12:24 PM, Douglas Chau < >> [email protected]> >> > > > > wrote: >> > > > > >> > > > >> Hmm @shameera, that's very true. Perhaps, we can store the >> submission >> > > > >> requests in registry. In the event that orchestrator goes down >> we can >> > > > >> recover them through registry afterwards. >> > > > >> >> > > > >> @Yoshimito, I didn't think about that - will take it into >> > > > >> consideration.Thanks for the insight! >> > > > >> >> > > > >> On Mon, Aug 3, 2015 at 12:11 PM, K Yoshimoto <[email protected]> >> > > wrote: >> > > > >> >> > > > >>> >> > > > >>> I think you also want to put in a check for successful >> submission, >> > > > >>> then take appropriate action on failed submission. It can be >> > > > >>> difficult to keep the submission limit up-to-date. >> > > > >>> >> > > > >>> On Mon, Aug 03, 2015 at 11:03:46AM -0400, Douglas Chau wrote: >> > > > >>> > Hey Devs, >> > > > >>> > >> > > > >>> > Just wanted to get some input on our to plan to implement the >> queue >> > > > >>> > throttling feature. >> > > > >>> > >> > > > >>> > Batch Queue Throttling: >> > > > >>> > - in Orchestrator, the current submit() function in >> > > > >>> GFACPassiveJobSubmitter >> > > > >>> > publishes jobs to rabbitmq immediately >> > > > >>> > - instead of publishing immediately we should pass the >> messages to >> > > a >> > > > >>> new >> > > > >>> > component, call it BatchQueueClass. >> > > > >>> > - we need a new component BatchQueueClass to periodically >> check to >> > > see >> > > > >>> when >> > > > >>> > we can unload jobs to submit >> > > > >>> > >> > > > >>> > Adding BatchQueueClass >> > > > >>> > - setup a new table(s) to contain compute resource names and >> their >> > > > >>> > corresponding queues' current job numbers and maximum job >> limits >> > > > >>> > - data models in airavata have information on maximum job >> > > submission >> > > > >>> limits >> > > > >>> > for a queue but no data on how many jobs are currently running >> > > > >>> > - the current job number will effectively act as a counter, >> which >> > > will >> > > > >>> be >> > > > >>> > incremented when a job is submitted, and when a job is >> completed >> > > > >>> > - once that is done, BatchQueueClass needs to periodically >> check >> > > new >> > > > >>> table >> > > > >>> > to see if the user's requested queue's current job number < >> queue >> > > job >> > > > >>> > limit. If it is then we can pop jobs off and submit them >> until we >> > > hit >> > > > >>> the >> > > > >>> > job limit; if not, then we wait until the we're under the job >> > > limit. >> > > > >>> > >> > > > >>> > How does this sound? >> > > > >>> > >> > > > >>> > Doug >> > > > >>> >> > > > >> >> > > > >> >> > > > > >> > > >> >
