[
https://issues.apache.org/jira/browse/MAPREDUCE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770121#action_12770121
]
Todd Lipcon commented on MAPREDUCE-967:
---------------------------------------
One note about this JIRA - it will need some fix for Streaming as well. The
common way that people ship scripts for streaming is using the "-file foo.py"
argument. This just includes foo.py in the job jar and assumes it will be
unpacked on the other side. With this patch, it won't unpack those and breaks
the -file argument's primary use case.
Two options to fix this issue:
# We could change -file to use DistributedCache instead. The fact that -file
and -files do different things is confusing in the first place, but changing
the behavior is potentially breaking change, I think.
# We could change Streaming to add all of the -file paths to the new
configuration parameter such that the existing behavior is preserved.
If no one else has a preference I'll go for option #2 above.
> TaskTracker does not need to fully unjar job jars
> -------------------------------------------------
>
> Key: MAPREDUCE-967
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-967
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: tasktracker
> Affects Versions: 0.21.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Attachments: mapreduce-967-branch-0.20.txt
>
>
> In practice we have seen some users submitting job jars that consist of
> 10,000+ classes. Unpacking these jars into mapred.local.dir and then cleaning
> up after them has a significant cost (both in wall clock and in unnecessary
> heavy disk utilization). This cost can be easily avoided
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.