[
https://issues.apache.org/jira/browse/OOZIE-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101904#comment-13101904
]
Hadoop QA commented on OOZIE-348:
---------------------------------
tucu00 remarked:
It seems to me that the requeueing logic is not correct, it should not alter
the order, but just ignore the dup queueing leaving the original one in the
existing place in the queue.
Default threadpool size to 120 is a bit too high for a default value. That
should be a site configuration value. The optimum size o the threadpool is
given by the load of your system and the hardware/OS resources you have.
IMO, a database will be an overkill. I would not replace the existing inmemory
solution by a DB solution, rather I'd leverage the fact that services are
pluggable and have a DB solution as well. Still, I'd suggest you test your
current load with a DB solution.
Regarding the comment that DB approach would be good for a hot-hot solution,
load distribution for an immemory solution could be easily handled by doing
something like handling IDs that satisfy JOBID MOD ${LIVE_OOZIE_INSTANCES} ==
${OOZIE_INSTANCE_ID}, the number of live instances and the intance ID would be
dynamically generated/stored in Zookeeper (which would be needed to provide
distributed lock support).
> GH-561: Redesign oozie internal queue
> -------------------------------------
>
> Key: OOZIE-348
> URL: https://issues.apache.org/jira/browse/OOZIE-348
> Project: Oozie
> Issue Type: Bug
> Reporter: Hadoop QA
>
> We had a lot of issues related to oozie internal queue. It includes queue
> overflow as well as re-queuing the same overly used commands to avoid
> starvation. There are other situations too. This problem becomes very obvious
> in very high-load case.
> I would like to open-up the discussion to find out a better architectural
> design for longer term considering a very high-load situation.
> The following proposals are to initiate the discussion that varied from
> complete overhaul to adjusting the current design:
> 1. Implement the queue idea into DB:
> Pros: Persistence. In hot-hot or load balancing situation it useful.
> Single place of truth. Different level of ordering could be done as needed
> through SQL. Don't bother about queue size. Don't need to recreate in every
> restart -- recovery service might be less busy.
> Cons: Extra DB access overhead.
> Middle approach could be to keep a memory cache with strict conditions. The
> details could be discussed later.
> 2. Re-queuing the same commands (that is used for throttling) -- should be
> redesigned. In this case, make sure queuing happens in the *same* place --
> not at the end of queue. I know this will break the queue meaning. In this
> case, we might need to use a different data structure.
> Currently queuing the same command at the end created starvation ( live-lock)
> like situation.
> 3. Multiple queues. One for coordinator input check that is used 99% of time.
> Comments?
> Regards,
> Mohammad
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira