Oleksandr Kalinin updated OOZIE-3181:
    Summary: High frequency LAST_ONLY coord. job with many past-time actions 
kills Oozie server  (was: High frequency coord. job LAST_ONLY with many 
past-time actions kills Oozie server)

> High frequency LAST_ONLY coord. job with many past-time actions kills Oozie 
> server
> ----------------------------------------------------------------------------------
>                 Key: OOZIE-3181
>                 URL: https://issues.apache.org/jira/browse/OOZIE-3181
>             Project: Oozie
>          Issue Type: Bug
>    Affects Versions: 4.3.0
>            Reporter: Oleksandr Kalinin
>            Priority: Major
> User submitting high frequency LAST_ONLY coordinator job for past time 
> (intentionally or by mistake) triggers enormous materialization loop for that 
> job and potentially OOM condition even with high heap settings.
> Simplest example is:
> coordStarts=2017-02-12T09:00Z
>  coordEnds=2019-02-12T09:00Z
>  coordFrequency=*/1 * * * *
> <execution>LAST_ONLY</execution>
> This triggers non throttled materialization of more than 500K actions lying 
> in the past which causes severe memory pressure and eventual GC overhead 
> lockout.
> At the same time by definition all past actions will be skipped anyway, thus 
> it seems that the only value in materializing them is ability to view SKIPPED 
> status later. Is it really worth the risk?
> Note : additional severity of this problem in terms of stability is that it's 
> not trivial to recover it on ZK-coordinated clusters. Write lock will persist 
> which will prevent kill command from taking desired effect, and that lock 
> will persist also after restart. To recover, write lock has to be manually 
> removed.
> Looking at materialization loop code, I believe there is potential for 
> algorithm improvement to prevent this issue.

This message was sent by Atlassian JIRA

Reply via email to