Re: Job throttling implementation clarification.

Shameera Rathnayaka Tue, 23 Sep 2014 11:07:25 -0700

Hi Lahiru,

On Tue, Sep 23, 2014 at 1:38 PM, Lahiru Gunathilake <[email protected]>
wrote:


> Its wrong to update the count before doing a successful job
> submission(Because finally job submission might fail and it is not the
> actual count in the queue), and even if we do it in the same place there
> will always be a race-condition.
>

Can't we say if jobSubmitter.submit(..) method return "true" the job has
been submitted to the compute resource without any issue ?  if we can then
increase the job count after the submit operation would solve our issue for
some extend(yes i can see it is hard to completely fix the race condition).


> If we want to really fix this we have implement a queue based approach
> where GFAC will pick jobs from worker queue and if the count is exceeded we
> delay the job submission.
>

Are you suggesting to move scheduling part to GFac instead of doing it in
Orchestrator? and is this a global queue where every GFac node can access
or queue per a GFac node?



>
>
>
> On Tue, Sep 23, 2014 at 1:04 PM, Shameera Rathnayaka <
> [email protected]> wrote:
>
>> Hi Devs,
>>
>> I am working on queue based job throttling implementations and here is
>> the relatedJIRA[1] ticket which is created to track down the implementation
>> steps.
>>
>> Following explain how job throttling has been implemented for now. This
>> is only apply for computer resources has batch queues define with it,
>> otherwise not.
>>
>> There is a validator call JobCountValidator, this validator check whether
>> there is enough space to submit a new job or not and return "true" and
>> "false" accordingly. I am using zookeeper to track the runtime data like
>> how many jobs have been submitted to a given host. With the current
>> implementation job count is increased when the job added to the monitoring
>> queue and decreased when the job removed from monitoring queue. I ran few
>> test and this approach is working fine. But after i ran a load test in high
>> rate i observed that this approach is not working as we are doing
>> validation in orchestrator and the job count update in gfac. This is due to
>> a race condition,  Orchestrator can still pass the validation step even we
>> have submitted allowed max job count to a resource but not yet updated the
>> job count in zookeeper. Therefore we need to do job submission and job
>> count increase in the same place to fix that.
>>
>> So potential place is SimpleOrchestratorImpl#launchExperiment method.
>> WDYT?
>>
>> As validation and launch operations are called using two client calls
>> still we have that race condition. i have sent a separate mail for that.
>>
>> Thanks,
>> Shameera.
>>
>> --
>> Best Regards,
>> Shameera Rathnayaka.
>>
>> email: shameera AT apache.org , shameerainfo AT gmail.com
>> Blog : http://shameerarathnayaka.blogspot.com/
>>
>
>
>
> --
> Research Assistant
> Science Gateways Group
> Indiana University
>



-- 
Best Regards,
Shameera Rathnayaka.

email: shameera AT apache.org , shameerainfo AT gmail.com
Blog : http://shameerarathnayaka.blogspot.com/

Re: Job throttling implementation clarification.

Reply via email to