[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202612#comment-13202612
 ] 

Allen Wittenauer commented on MAPREDUCE-3824:
---------------------------------------------

There is no doubt the patch is a hack, but it solved my immediate problems 
because as it stands, distributed caches are really broken at scale.

Some background.  I have a team of users that have several 36GB distributed 
caches. When these caches are in play, most of the system is basically locked 
while these caches get built.  This patch was really geared towards making sure 
that these massive caches at least get deleted.  Without these patches in 
place, the mapred tmp spaces fill and tasks fail, eventually leading to mapred 
framework collapse. 

There are a lot of other problems that show up with caches this large:
* Hadoop doesn't have a size limit check on caches as part of the job 
submission process [So any hand waving about "don't use caches that big!" are 
null and void since there is no way to actually stop a user from doing that!]
* the setup and cleanup tasks also trigger cache downloads.
* tasktrackers appear to be frozen for *all* tasks during cache downloads, with 
the task stuck in the extremely unhelpful "unassigned" state.
* the methodology of updating the private cache as a different step seems 
unnecessary given the permissions at the file system level.

What really needs to happen is a massive overhaul of the entire distributed 
cache system.  But that's a bigger project, preferably for someone who gets 
paid to do hadoop development full time.  So, like all of the patches I've been 
submitting lately, I'm not expecting them to get committed. But this is enough 
of a patch for someone who needs a useable system until a working release ships.
                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when 
> they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to