[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Robert Joseph Evans (Commented) (JIRA) Tue, 07 Feb 2012 08:07:24 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202496#comment-13202496
 ]


Robert Joseph Evans commented on MAPREDUCE-3824:
------------------------------------------------

I like the concept of the patch.  Volatile is definitely needed here, my bad on 
that one.  I also like that you are doing a DU to update the size of the cached 
objects if they are 0.  I do have some issues with the patch though.

The first is that even though the DU size update is being done on a separate 
thread it is being done with the cachedArchives lock held.  The amount of time 
it takes to do a DU could be significant.  Nothing new can be added to the 
cache while the cachedArchives lock is held, so it could be blocking other new 
tasks from making progress.  I would really prefer to see this done in two 
passes, similar to how we delete out entries.  The first pass would go through 
all entries and identify those that need to be updated, the second pass would 
be to update those entries without the lock held.  Then once we have all of the 
entries updated we can look at cleaning up the distributed cache. 

The second is that we are updating the size too late.  We decide how much space 
needs to be deleted to get us back under the desired amount based totally on 
the size reported by BaseDirManager, which in turn gets its data from the 
CacheStatus object.  The issue is that in the current patch we first calculate 
how much needs to be removed, then we update the size of the archives, then we 
delete them.  This is not that critical, because it just means that in the next 
pass they would be deleted, so this is really very minor, but should be covered 
by doing the update in two passes.

I am not sure exactly what are the situations that the size is not being set.  
I would like to know exactly which situations the current code is missing, 
because like I said previously the code that computes the used size goes 
completely off of what is reported to BaseDirManager, unfortunately there are 
some issues with BaseDirManger where if we are too aggressive with setting the 
size we might double count some archives, which eventually would make it so 
that the BaseDirManager thinks it is full all the time, which would be very bad.

                
> Distributed caches are not removed properly
> -------------------------------------------
>
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.txt
>
>
> Distributed caches are not being properly removed by the TaskTracker when 
> they are expected to be expired. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly

Reply via email to