[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768731#action_12768731
 ] 

Hemanth Yamijala commented on MAPREDUCE-1098:
---------------------------------------------

Arun, this may not work as well.

Basically, the localization code is like this:

{code}
synchronized (cachedArchives) {
  get lcacheStatus
  synchronized (lcacheStatus) {
    increment reference count
  }
}
synchronized (lcacheStatus) {
  localize cache
}
{code}

The delete cache code is like this:

{code}
synchronized (cachedArchives) {
  for each lcacheStatus {
    synchronized (lcacheStatus) {
      if (lcacheStatus.refCount == 0) {
        //
      }
    }
  }
}
{code}

The problem is when iterating to delete, if a localizing thread is localizing a 
cache file for a particular cache object, the delete thread will wait to 
acquire the lock on the cache object *after* acquiring the global lock. Since 
the localization could take a long time, other threads will be blocked, in 
effect not solving the problem we trying to.

Does this make sense ?

It seems like a very correct approach should *not* require to lock any object 
when doing a costly operation like a DFS download. Other threads should wait 
for a download complete event notification, or some such. But those are 
sweeping changes. 

One solution Amarsri and I discussed was to see if making the reference count 
an AtomicInteger would help. Then maybe, its value can be read without having 
to acquire a lock on the cache status object. So, the delete code will be 
something like this:

{code}
synchronized (cachedArchives) {
  for each lcacheStatus {
    if (lcacheStatus.atomicReferenceCount.get() == 0) {
      synchronized (lcacheStatus) {
        // continue operation as in your patch.
      }
    }
  }
}
{code}

Since we are guaranteed that code that's localizing a path will have the 
reference count as non-zero, it will never try and proceed to the delete 
operation.

Could this work ?

> Incorrect synchronization in DistributedCache causes TaskTrackers to freeze 
> up during localization of Cache for tasks.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1098
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1098
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>            Reporter: Sreekanth Ramakrishnan
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.21.0
>
>         Attachments: MAPREDUCE-1098.patch, patch-1098-0.20.txt, 
> patch-1098-1.txt, patch-1098-2.txt, patch-1098-ydist.txt, patch-1098.txt
>
>
> Currently {{org.apache.hadoop.filecache.DistributedCache.getLocalCache(URI, 
> Configuration, Path, FileStatus, boolean, long, Path, boolean)}} allows only 
> one {{TaskRunner}} thread in TT to localize {{DistributedCache}} across jobs. 
> Current way of synchronization is across baseDir this has to be changed to 
> lock on the same baseDir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to