[
https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202612#comment-13202612
]
Allen Wittenauer commented on MAPREDUCE-3824:
---------------------------------------------
There is no doubt the patch is a hack, but it solved my immediate problems
because as it stands, distributed caches are really broken at scale.
Some background. I have a team of users that have several 36GB distributed
caches. When these caches are in play, most of the system is basically locked
while these caches get built. This patch was really geared towards making sure
that these massive caches at least get deleted. Without these patches in
place, the mapred tmp spaces fill and tasks fail, eventually leading to mapred
framework collapse.
There are a lot of other problems that show up with caches this large:
* Hadoop doesn't have a size limit check on caches as part of the job
submission process [So any hand waving about "don't use caches that big!" are
null and void since there is no way to actually stop a user from doing that!]
* the setup and cleanup tasks also trigger cache downloads.
* tasktrackers appear to be frozen for *all* tasks during cache downloads, with
the task stuck in the extremely unhelpful "unassigned" state.
* the methodology of updating the private cache as a different step seems
unnecessary given the permissions at the file system level.
What really needs to happen is a massive overhaul of the entire distributed
cache system. But that's a bigger project, preferably for someone who gets
paid to do hadoop development full time. So, like all of the patches I've been
submitting lately, I'm not expecting them to get committed. But this is enough
of a patch for someone who needs a useable system until a working release ships.
> Distributed caches are not removed properly
> -------------------------------------------
>
> Key: MAPREDUCE-3824
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: distributed-cache
> Affects Versions: 1.0.0
> Reporter: Allen Wittenauer
> Priority: Critical
> Attachments: MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when
> they are expected to be expired.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira