[
https://issues.apache.org/jira/browse/OOZIE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522026#comment-14522026
]
Bowen Zhang commented on OOZIE-2223:
------------------------------------
[~ben.roling], thank you for bringing this up. Oozie documentation currently
doesn't cover 100% of oozie usage and some of the covered topics are not
specific enough. We encourage all users and devs to help us improve the
documentation.
> Improve documentation with regard to Java action retries
> --------------------------------------------------------
>
> Key: OOZIE-2223
> URL: https://issues.apache.org/jira/browse/OOZIE-2223
> Project: Oozie
> Issue Type: Improvement
> Components: docs
> Affects Versions: 3.3.2, 4.1.0, 4.0.1
> Reporter: Ben Roling
> Fix For: trunk
>
> Attachments: OOZIE-2223-2.patch
>
>
> My organization has been bitten by a mistake in the way we have written Java
> action applications. I would like to introduce a documentation change that
> might reduce the likelihood that others new to Oozie make the same mistake.
> The mistake is not accounting for the possibility that launcher tasks will
> fail due to reasons such as cluster maintenance. We have a number of jobs
> that take input and output paths as arguments. Our code had been
> specifically written such that if the output path already exists the job
> fails to avoid inadvertently deleting an output that may have been consumed
> by a downstream job.
> This has bitten us during cluster maintenance that requires TaskTracker
> restarts. During such an event any launcher running on a TaskTracker at the
> time of the TaskTracker restart fails and is retried on another TaskTracker.
> The new attempt of the launcher task fails due to the output directory
> already existing. This in turn fails the whole workflow. Maintenance that
> requires restarting all TaskTrackers can end up causing a lot of workflow
> failures.
> The current documentation does hint at such issues via mention of the
> “prepare” block, but I don’t think the explanation of this block is clear
> enough for newcomers to understand its use. Furthermore, I’m not sure the
> prepare block is the best answer for how to handle the specific types of
> issues I am referring to. A “delete” action in a prepare block will delete
> content regardless of state, which provides the possibility that a previously
> completed good output could be deleted. This can lead to issues such as
> corrupted traceability when there is a need to trace an output back to the
> inputs that produced it.
> I believe a more appropriate implementation to address the possibility of
> launcher task failure is to write the action such that it uses a previous
> complete output without deleting or reprocessing. Only if it detects an
> incomplete output does it delete the output and re-run the processing to
> produce the output. This protects from the possibility of accidental output
> destruction.
> Furthermore, some types of actions spawn activity that runs asynchronously
> outside the context of the launcher task itself. In such cases the action
> author must take care to clean up any stray activity spawned prior to the
> failure of the initial launcher task to ensure it does not collide with
> activity produced by the new attempt of the launcher. In the case of my
> organization, such activity includes child M/R jobs spawned from the Apache
> Crunch pipelines we invoke from our Java actions. Depending on the design of
> the action, it can be required to find and kill such child jobs before
> invoking the new pipeline and spawning new child jobs.
> I will attach a patch that demonstrates one possible documentation
> improvement to shed light on these issues but I appreciate feedback and any
> other ideas.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)