Oleksandr Kalinin created OOZIE-3181:
----------------------------------------

             Summary: High frequency coord. job LAST_ONLY with many past-time 
actions kills Oozie server
                 Key: OOZIE-3181
                 URL: https://issues.apache.org/jira/browse/OOZIE-3181
             Project: Oozie
          Issue Type: Bug
    Affects Versions: 4.3.0
            Reporter: Oleksandr Kalinin


User submitting high frequency coordinator job for past time (intentionally or 
by mistake) triggers enormous materialization loop for that job and potentially 
OOM condition even with high heap settings.

Simplest example is:

coordStarts=2017-02-12T09:00Z
coordEnds=2019-02-12T09:00Z
coordFrequency=*/1 * * * *

<execution>LAST_ONLY</execution>

This triggers non throttled materialization of more than 500K actions lying in 
the past which causes severe memory pressure and eventual GC overhead lockout.

At the same time by definition all past actions will be skipped anyway, thus it 
seems that the only value in materializing them is ability to view SKIPPED 
status later. Is it really worth the risk?

Note : additional severity of this problem in terms of stability is that it's 
not trivial to recover it on ZK-coordinated clusters. Write lock will persist 
which will prevent kill command from taking desired effect, and that lock will 
persist also after restart. To recover, write lock has to be manually removed.

Looking at materialization loop code, I believe there is potential for 
algorithm and throttling improvement to prevent this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to