Hi Jacques, I'm working on implementing the priority queue approach at the moment for a client. All things going well it will be in production in a couple of weeks and I'll report back then with a patch.
Regards Scott On Tue, 26 Feb 2019 at 03:11, Jacques Le Roux <jacques.le.r...@les7arts.com> wrote: > Hi, > > I put this comment there with OFBIZ-10002 trying to document why we have 5 > as hardcoded value of /max-threads/ attribute in /thread-pool/ element > (serviceengine.xml). At this moment Scott already mentioned[1]: > > /Honestly I think the topic is generic enough that OFBiz doesn't need > to provide any information at all. Thread pool sizing is not exclusive to > OFBiz and it would be strange for anyone to modify the numbers without > first researching sources that provide far more detail than a few sentences > in our config files will ever cover./ > > I agree with Scott and Jacopo that jobs are more likely IO rather than CPU > bounded. So I agree that we should take that into account, change the > current algorithm and remove this somehow misleading comment. Scott's > suggestion in his 2nd email sounds good to me. So If I understood well we > could > use an unbounded but finally limited queue, like it was before. > > Although with all of that said, after a quick second look it appears > that > the current implementation doesn't try poll for more jobs than the > configured limit (minus already queued jobs) so we might be fine with > an > unbounded queue implementation. We'd just need to alter the call to > JobManager.poll(int limit) to not pass in > executor.getQueue().remainingCapacity() and instead pass in something > like > (threadPool.getJobs() - executor.getQueue().size()) > > I'm fine with that as it would continue to prevent hitting physical > limitations and can be tweaked by users as it's now. Note that it seems > though > uneasy to tweak as we received already several "complaints" about it. > > Now one of the advantage of a PriorityBlockingQueue is priority. And to > take advantage of that we can't rely on "/natural ordering"/ and need to > implement Comparable (which does no seem easy). Nicolas provided some > leads below and this should be discussed. The must would be to have that > parametrised, of course. > > My 2 cts > // > > [1] https://markmail.org/message/ixzluzd44rgloa2j > > Jacques > > Le 06/02/2019 à 14:24, Nicolas Malin a écrit : > > Hello Scott, > > > > On a customer project we use massively the job manager with an average > of one hundred thousand job per days. > > > > We have different cases like, huge long jobs, async persistent job, fast > regular job. The mainly problem that we detect has been (as you notified) > > the long jobs that stuck poller's thread and when we restart OFBiz (we > are on continuous delivery) we hadn't windows this without crash some jobs. > > > > To solve try with Gil to analyze if we can load some weighting on job > definition to help the job manager on what jobs on the pending queue it can > > push on queued queue. We changed own vision to create two pools, one for > system maintenance and huge long jobs managed by two ofbiz instances and an > > other to manage user activity jobs also managed by two instances. We > also added on service definition an information to indicate the > predilection pool > > > > This isn't a big deal and not resolve the stuck pool but all blocked > jobs aren't vital for daily activity. > > > > For crashed job, we introduced in trunk service lock that we set before > an update and wait a windows for the restart. > > > > At this time for all OOM detected we reanalyse the origin job and tried > to decompose it by persistent async service to help loading repartition. > > > > If I had more time, I would be oriented job improvement to : > > > > * Define an execution plan rule to link services and poller without > touch any service definition > > > > * Define configuration by instance for the job vacuum to refine by > service volumetric > > > > This feedback is a little confused Scott, maybe you found interesting > things > > > > Nicolas > > > > On 30/01/2019 20:47, Scott Gray wrote: > >> Hi folks, > >> > >> Just jotting down some issues with the JobManager over noticed over the > >> last few days: > >> 1. min-threads in serviceengine.xml is never exceeded unless the job > count > >> in the queue exceeds 5000 (or whatever is configured). Is this not > obvious > >> to anyone else? I don't think this was the behavior prior to a > refactoring > >> a few years ago. > >> 2. The advice on the number of threads to use doesn't seem good to me, > it > >> assumes your jobs are CPU bound when in my experience they are more > likely > >> to be I/O bound while making db or external API calls, sending emails > etc. > >> With the default setup, it only takes two long running jobs to > effectively > >> block the processing of any others until the queue hits 5000 and the > other > >> threads are finally opened up. If you're not quickly maxing out the > queue > >> then any other jobs are stuck until the slow jobs finally complete. > >> 3. Purging old jobs doesn't seem to be well implemented to me, from what > >> I've seen the system is only capable of clearing a few hundred per > minute > >> and if you've filled the queue with them then regular jobs have to queue > >> behind them and can take many minutes to finally be executed. > >> > >> I'm wondering if anyone has experimented with reducing the queue the > size? > >> I'm considering reducing it to say 100 jobs per thread (along with > >> increasing the thread count). In theory it would reduce the time real > jobs > >> have to sit behind PurgeJobs and would also open up additional threads > for > >> use earlier. > >> > >> Alternatively I've pondered trying a PriorityBlockingQueue for the job > >> queue (unfortunately the implementation is unbounded though so it isn't > a > >> drop-in replacement) so that PurgeJobs always sit at the back of the > >> queue. It might also allow prioritizing certain "user facing" jobs > (such > >> as asynchronous data imports) over lower priority less time critical > jobs. > >> Maybe another option (or in conjunction) is some sort of "swim-lane" > >> queue/executor that allocates jobs to threads based on prior execution > >> speed so that slow running jobs can never use up all threads and block > >> faster jobs. > >> > >> Any thoughts/experiences you have to share would be appreciated. > >> > >> Thanks > >> Scott > >> > > >