Hi Lahiru, On Tue, Sep 23, 2014 at 1:38 PM, Lahiru Gunathilake <[email protected]> wrote:
> Its wrong to update the count before doing a successful job > submission(Because finally job submission might fail and it is not the > actual count in the queue), and even if we do it in the same place there > will always be a race-condition. > Can't we say if jobSubmitter.submit(..) method return "true" the job has been submitted to the compute resource without any issue ? if we can then increase the job count after the submit operation would solve our issue for some extend(yes i can see it is hard to completely fix the race condition). > If we want to really fix this we have implement a queue based approach > where GFAC will pick jobs from worker queue and if the count is exceeded we > delay the job submission. > Are you suggesting to move scheduling part to GFac instead of doing it in Orchestrator? and is this a global queue where every GFac node can access or queue per a GFac node? > > > > On Tue, Sep 23, 2014 at 1:04 PM, Shameera Rathnayaka < > [email protected]> wrote: > >> Hi Devs, >> >> I am working on queue based job throttling implementations and here is >> the relatedJIRA[1] ticket which is created to track down the implementation >> steps. >> >> Following explain how job throttling has been implemented for now. This >> is only apply for computer resources has batch queues define with it, >> otherwise not. >> >> There is a validator call JobCountValidator, this validator check whether >> there is enough space to submit a new job or not and return "true" and >> "false" accordingly. I am using zookeeper to track the runtime data like >> how many jobs have been submitted to a given host. With the current >> implementation job count is increased when the job added to the monitoring >> queue and decreased when the job removed from monitoring queue. I ran few >> test and this approach is working fine. But after i ran a load test in high >> rate i observed that this approach is not working as we are doing >> validation in orchestrator and the job count update in gfac. This is due to >> a race condition, Orchestrator can still pass the validation step even we >> have submitted allowed max job count to a resource but not yet updated the >> job count in zookeeper. Therefore we need to do job submission and job >> count increase in the same place to fix that. >> >> So potential place is SimpleOrchestratorImpl#launchExperiment method. >> WDYT? >> >> As validation and launch operations are called using two client calls >> still we have that race condition. i have sent a separate mail for that. >> >> Thanks, >> Shameera. >> >> -- >> Best Regards, >> Shameera Rathnayaka. >> >> email: shameera AT apache.org , shameerainfo AT gmail.com >> Blog : http://shameerarathnayaka.blogspot.com/ >> > > > > -- > Research Assistant > Science Gateways Group > Indiana University > -- Best Regards, Shameera Rathnayaka. email: shameera AT apache.org , shameerainfo AT gmail.com Blog : http://shameerarathnayaka.blogspot.com/
