[jira] [Updated] (OOZIE-3181) High frequency LAST_ONLY coord. job with many past time actions kills Oozie server

Oleksandr Kalinin (JIRA) Mon, 12 Feb 2018 04:52:26 -0800

     [ 
https://issues.apache.org/jira/browse/OOZIE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Oleksandr Kalinin updated OOZIE-3181:
-------------------------------------
    Description: 
User submitting high frequency LAST_ONLY coordinator job for past time 
(intentionally or by mistake) triggers enormous materialization loop for that 
job and potentially OOM condition even with high heap settings.

Simplest example is:

coordStarts=2017-02-12T09:00Z
 coordEnds=2019-02-12T09:00Z
 coordFrequency=*/1 * * * *

<execution>LAST_ONLY</execution>

Since throttling parameters are ignored on LAST_ONLY jobs, this triggers non 
throttled materialization of more than 500K actions lying in the past which 
causes severe memory pressure and eventual GC overhead lockout.

At the same time by definition all past actions will be skipped anyway, thus it 
seems that the only value in materializing them is ability to view SKIPPED 
status later. Is it really worth the risk?

Note : additional severity of this problem is that it's not trivial to recover 
it on ZK-coordinated clusters. Write lock will persist which will prevent kill 
command from taking desired effect, and that lock will persist also after 
restart. To recover, write lock has to be manually removed.

Looking at materialization loop code, I believe there is potential for 
algorithm improvement to prevent this issue.

  was:
User submitting high frequency LAST_ONLY coordinator job for past time 
(intentionally or by mistake) triggers enormous materialization loop for that 
job and potentially OOM condition even with high heap settings.

Simplest example is:

coordStarts=2017-02-12T09:00Z
 coordEnds=2019-02-12T09:00Z
 coordFrequency=*/1 * * * *

<execution>LAST_ONLY</execution>

This triggers non throttled materialization of more than 500K actions lying in 
the past which causes severe memory pressure and eventual GC overhead lockout.

At the same time by definition all past actions will be skipped anyway, thus it 
seems that the only value in materializing them is ability to view SKIPPED 
status later. Is it really worth the risk?

Note : additional severity of this problem is that it's not trivial to recover 
it on ZK-coordinated clusters. Write lock will persist which will prevent kill 
command from taking desired effect, and that lock will persist also after 
restart. To recover, write lock has to be manually removed.

Looking at materialization loop code, I believe there is potential for 
algorithm improvement to prevent this issue.


> High frequency LAST_ONLY coord. job with many past time actions kills Oozie 
> server
> ----------------------------------------------------------------------------------
>
>                 Key: OOZIE-3181
>                 URL: https://issues.apache.org/jira/browse/OOZIE-3181
>             Project: Oozie
>          Issue Type: Bug
>    Affects Versions: 4.3.0
>            Reporter: Oleksandr Kalinin
>            Priority: Major
>
> User submitting high frequency LAST_ONLY coordinator job for past time 
> (intentionally or by mistake) triggers enormous materialization loop for that 
> job and potentially OOM condition even with high heap settings.
> Simplest example is:
> coordStarts=2017-02-12T09:00Z
>  coordEnds=2019-02-12T09:00Z
>  coordFrequency=*/1 * * * *
> <execution>LAST_ONLY</execution>
> Since throttling parameters are ignored on LAST_ONLY jobs, this triggers non 
> throttled materialization of more than 500K actions lying in the past which 
> causes severe memory pressure and eventual GC overhead lockout.
> At the same time by definition all past actions will be skipped anyway, thus 
> it seems that the only value in materializing them is ability to view SKIPPED 
> status later. Is it really worth the risk?
> Note : additional severity of this problem is that it's not trivial to 
> recover it on ZK-coordinated clusters. Write lock will persist which will 
> prevent kill command from taking desired effect, and that lock will persist 
> also after restart. To recover, write lock has to be manually removed.
> Looking at materialization loop code, I believe there is potential for 
> algorithm improvement to prevent this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3181) High frequency LAST_ONLY coord. job with many past time actions kills Oozie server

Reply via email to