[
https://issues.apache.org/jira/browse/MAPREDUCE-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283821#comment-13283821
]
Ahmed Radwan commented on MAPREDUCE-4284:
-----------------------------------------
Thanks Arun,
Let me add more details. I think it's not just the tasklogs and this is why
this property exists. We have seen cases where inspecting the contents of the
containers' localized file directories and log directories were extremely
useful in troubleshooting problems (e.g. AM failure to start issues).
I think easily controlling this property is equally important in production
clusters. Consider the following scenario:
* A job failing on a production cluster.
* Tasklogs are not showing much, and it is required to inspect the containers'
files for any clues.
* It is now required to change this configuration property (e.g. set it to 1
day) and restart every NM in the cluster (see how expensive this is).
* The problem for this job is solved, but now these directories are kept for
every submitted job, which is an unneeded and expensive storage problem. To
solve that, we need to change back the property and restart NMs on all nodes
again.
Also thinking about this issue more: YARN is a general framework, and
applications other than MapReduce need to considered, and their ability to hint
to yarn to keep these files. So we can't generalize assumptions about
information available through specific application services (e.g. MapReduce
JobHistoryServer). I think the new proposed property above can be generalized
across applications (or the Application interface could be extended).
bq. Your proposal doesn't work because the NodeManager doesn't load jobConf of
the container... this would require changes to ContainerManager protocol.
Yes, I only wrote how the new delay will be calculated, but how this new
jobConf property is communicated to the DeletionService will require more
changes as you highlighted. The question here is whether the added benefit
outweighs the effort of these extra changes. Thoughts?
> Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis
> ------------------------------------------------------------------------
>
> Key: MAPREDUCE-4284
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4284
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Reporter: Ahmed Radwan
> Assignee: Ahmed Radwan
>
> The yarn.nodemanager.delete.debug-delay-sec property is helpful in debugging
> jobs (inspecting container logs/local dirs after the job finishes). Currently
> it is a nodemanager property and changing it requires restarting the
> nodemanager. In a production cluster this can be a real problem. It is better
> to have this property set on a per-job basis and not requiring the restart of
> nodemanagers.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira