[
https://issues.apache.org/jira/browse/OOZIE-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rohini Palaniswamy updated OOZIE-1527:
--------------------------------------
Assignee: Purshotam Shah (was: Mona Chitnis)
Optimizations that we plan to do:
CoordMaterializeTriggerService
- lookupInterval and scheduling interval of
CoordMaterializeTriggerRunnable are same. The lookup interval looks for jobs
with next materialized time within the lookup interval. Earlier we reduced it
to 2 mins to make materialization run frequently, but it also only tries to
materialize jobs for the nominal time just 2 mins before the nominal time is
reached. And if there are lot of coord jobs having to be materialized (happens
especially on hour boundaries) they all get delayed very badly and get only
picked up for materialization after the nominal time has actually passed
causing SLA misses. That is why we want to have 2 different settings for lookup
and schedule interval where we can schedule frequently and lookup more in
advance. For eg: lookInterval can be set to 10 mins and schedule interval to 2
mins. This way we will materialize 10 mins in advance instead of just 2 mins
before nominal time and will have breather to meet the SLA.
materializeCoordJobs():
- Use QueryExecutor (avoid new object every time and no transaction for
reads) and get rid of CoordJobsToBeMaterializedJPAExecutor,
CoordActionsActiveCountJPAExecutor (don't need this in TriggerService anymore
with change to coord jobs to be materialized query, but used in
CoordMaterializeTransitionXCommand)
- GET_COORD_JOBS_OLDER_THAN
– Select only specific columns instead of the whole coord job to make
the query faster.
- Change the query to only fetch coord jobs which have maxThrottling >
(select count(a) from CoordinatorActionBean a where a.jobId = :jobId AND
a.statusStr = 'WAITING'). This will make it very optimal and never fetch any of
the rogue coordinator jobs into materialization at all. Can revert OOZIE-1539
with this change.
CoordMaterializeTransitionXCommand
- In case of catchup job ( determined based on last materialized nominal
time, current time and frequency) dynamically increase the
materializationWindow from 1 hr to that of max possible based on number of
waiting actions. i.e if last nominal time is Feb 1 12:00, current time is Feb
10 14:00, numwaitingactions is 3, maxthrottle is 12 and frequency is 1 hr,
materializationWindow = frequency * (maxthrottle – numwaitingactions) = 9 hr.
This is to address slow materialization of catch up jobs.
[~shwethags],
Do you see any of the above changes impacting the 1 min jobs that you
have? It should surely make it better, but wondering just in case if there is
something we missed as we don't have any experience running 1 min jobs.
> Fix scalability issues with coordinator materialization
> -------------------------------------------------------
>
> Key: OOZIE-1527
> URL: https://issues.apache.org/jira/browse/OOZIE-1527
> Project: Oozie
> Issue Type: Bug
> Components: coordinator
> Affects Versions: trunk
> Reporter: Mona Chitnis
> Assignee: Purshotam Shah
> Fix For: trunk
>
> Original Estimate: 96h
> Remaining Estimate: 96h
>
> In certain situations when there is a large number of coordinators in the
> system, they have been observed to create huge backlog in materialization,
> and progressing very slow compared to expected. This patch can be looked upon
> as both a bug-fix or an enhancement addressing following points:
> 1. 'materialization.system.limit' leads to bringing Coord jobs in LRU
> fashion, but some of them may already be maxing out at actions to materialize
> (= throttle), and < limit jobs may actually undergo materialization. This
> patch does a second iteration of loading jobs to get materialized to reduce
> backlog
> 2. 'materialization.window' being 1 hour may work in most cases, but hourly
> jobs are seen to face significant slowdown at times, by lot of other minute
> jobs getting materialized. Therefore, window can be doubled (i.e. 2 hours)
> when job is hourly/daily.
> 3. For hourly coordinators, it is consistently seen that materialization
> occurs only near the end of the hour. e.g. for action whose nominal time is
> 2:00, action creation time is 1:59, if nominal time - 3:00, creation time is
> 2:58 and so on. If window is an hour in the future, doesn't explain why
> materialization won't occur anytime in the middle of the preceding hour.
--
This message was sent by Atlassian JIRA
(v6.2#6252)