Hi Devs, I am working on queue based job throttling implementations and here is the relatedJIRA[1] ticket which is created to track down the implementation steps.
Following explain how job throttling has been implemented for now. This is only apply for computer resources has batch queues define with it, otherwise not. There is a validator call JobCountValidator, this validator check whether there is enough space to submit a new job or not and return "true" and "false" accordingly. I am using zookeeper to track the runtime data like how many jobs have been submitted to a given host. With the current implementation job count is increased when the job added to the monitoring queue and decreased when the job removed from monitoring queue. I ran few test and this approach is working fine. But after i ran a load test in high rate i observed that this approach is not working as we are doing validation in orchestrator and the job count update in gfac. This is due to a race condition, Orchestrator can still pass the validation step even we have submitted allowed max job count to a resource but not yet updated the job count in zookeeper. Therefore we need to do job submission and job count increase in the same place to fix that. So potential place is SimpleOrchestratorImpl#launchExperiment method. WDYT? As validation and launch operations are called using two client calls still we have that race condition. i have sent a separate mail for that. Thanks, Shameera. -- Best Regards, Shameera Rathnayaka. email: shameera AT apache.org , shameerainfo AT gmail.com Blog : http://shameerarathnayaka.blogspot.com/
