Re: Job throttling implementation clarification.

Shameera Rathnayaka Tue, 23 Sep 2014 13:05:42 -0700

Hi Lahiru,

I could able to resolve this by moving the job throttle logic to
launchExperiment method and synchronizing jobSubmitter.submit and job count
update tasks. This is introduced small performance bottle neck, if we can
tolerate that bottle neck in job submission phase then this will work
without an issue where we have one Orchestrator in our deployment. WDYT?
can we go with this and later change it to a better way ?


Thanks,
Shameera.

On Tue, Sep 23, 2014 at 2:06 PM, Shameera Rathnayaka <[email protected]
> wrote:

> Hi Lahiru,
>
> On Tue, Sep 23, 2014 at 1:38 PM, Lahiru Gunathilake <[email protected]>
> wrote:
>
>> Its wrong to update the count before doing a successful job
>> submission(Because finally job submission might fail and it is not the
>> actual count in the queue), and even if we do it in the same place there
>> will always be a race-condition.
>>
>
> Can't we say if jobSubmitter.submit(..) method return "true" the job has
> been submitted to the compute resource without any issue ?  if we can then
> increase the job count after the submit operation would solve our issue for
> some extend(yes i can see it is hard to completely fix the race condition).
>
>
>> If we want to really fix this we have implement a queue based approach
>> where GFAC will pick jobs from worker queue and if the count is exceeded we
>> delay the job submission.
>>
>
> Are you suggesting to move scheduling part to GFac instead of doing it in
> Orchestrator? and is this a global queue where every GFac node can access
> or queue per a GFac node?
> 
>
>
>>
>>
>>
>> On Tue, Sep 23, 2014 at 1:04 PM, Shameera Rathnayaka <
>> [email protected]> wrote:
>>
>>> Hi Devs,
>>>
>>> I am working on queue based job throttling implementations and here is
>>> the relatedJIRA[1] ticket which is created to track down the implementation
>>> steps.
>>>
>>> Following explain how job throttling has been implemented for now. This
>>> is only apply for computer resources has batch queues define with it,
>>> otherwise not.
>>>
>>> There is a validator call JobCountValidator, this validator check
>>> whether there is enough space to submit a new job or not and return "true"
>>> and "false" accordingly. I am using zookeeper to track the runtime data
>>> like how many jobs have been submitted to a given host. With the current
>>> implementation job count is increased when the job added to the monitoring
>>> queue and decreased when the job removed from monitoring queue. I ran few
>>> test and this approach is working fine. But after i ran a load test in high
>>> rate i observed that this approach is not working as we are doing
>>> validation in orchestrator and the job count update in gfac. This is due to
>>> a race condition,  Orchestrator can still pass the validation step even we
>>> have submitted allowed max job count to a resource but not yet updated the
>>> job count in zookeeper. Therefore we need to do job submission and job
>>> count increase in the same place to fix that.
>>>
>>> So potential place is SimpleOrchestratorImpl#launchExperiment method.
>>> WDYT?
>>>
>>> As validation and launch operations are called using two client calls
>>> still we have that race condition. i have sent a separate mail for that.
>>>
>>> Thanks,
>>> Shameera.
>>>
>>> --
>>> Best Regards,
>>> Shameera Rathnayaka.
>>>
>>> email: shameera AT apache.org , shameerainfo AT gmail.com
>>> Blog : http://shameerarathnayaka.blogspot.com/
>>>
>>
>>
>>
>> --
>> Research Assistant
>> Science Gateways Group
>> Indiana University
>>
>
>
>
> --
> Best Regards,
> Shameera Rathnayaka.
>
> email: shameera AT apache.org , shameerainfo AT gmail.com
> Blog : http://shameerarathnayaka.blogspot.com/
>



-- 
Best Regards,
Shameera Rathnayaka.

email: shameera AT apache.org , shameerainfo AT gmail.com
Blog : http://shameerarathnayaka.blogspot.com/

Re: Job throttling implementation clarification.

Reply via email to