[jira] [Commented] (MAPREDUCE-4284) Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis

Ahmed Radwan (JIRA) Fri, 25 May 2012 16:42:24 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283821#comment-13283821
 ]


Ahmed Radwan commented on MAPREDUCE-4284:
-----------------------------------------

Thanks Arun, 

Let me add more details. I think it's not just the tasklogs and this is why 
this property exists. We have seen cases where inspecting the contents of the 
containers' localized file directories and log directories were extremely 
useful in troubleshooting problems (e.g. AM failure to start issues).

I think easily controlling this property is equally important in production 
clusters. Consider the following scenario:

* A job failing on a production cluster.
* Tasklogs are not showing much, and it is required to inspect the containers' 
files for any clues.
* It is now required to change this configuration property (e.g. set it to 1 
day) and restart every NM in the cluster (see how expensive this is).
* The problem for this job is solved, but now these directories are kept for 
every submitted job, which is an unneeded and expensive storage problem. To 
solve that, we need to change back the property and restart NMs on all nodes 
again.

Also thinking about this issue more: YARN is a general framework, and 
applications other than MapReduce need to considered, and their ability to hint 
to yarn to keep these files. So we can't generalize assumptions about 
information available through specific application services (e.g. MapReduce 
JobHistoryServer). I think the new proposed property above can be generalized 
across applications (or the Application interface could be extended).

bq. Your proposal doesn't work because the NodeManager doesn't load jobConf of 
the container... this would require changes to ContainerManager protocol.

Yes, I only wrote how the new delay will be calculated, but how this new 
jobConf property is communicated to the DeletionService will require more 
changes as you highlighted. The question here is whether the added benefit 
outweighs the effort of these extra changes. Thoughts?
                
> Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4284
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4284
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Ahmed Radwan
>            Assignee: Ahmed Radwan
>
> The yarn.nodemanager.delete.debug-delay-sec property is helpful in debugging 
> jobs (inspecting container logs/local dirs after the job finishes). Currently 
> it is a nodemanager property and changing it requires restarting the 
> nodemanager. In a production cluster this can be a real problem. It is better 
> to have this property set on a per-job basis and not requiring the restart of 
> nodemanagers. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4284) Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis

Reply via email to