[
https://issues.apache.org/jira/browse/MAPREDUCE-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768731#action_12768731
]
Hemanth Yamijala commented on MAPREDUCE-1098:
---------------------------------------------
Arun, this may not work as well.
Basically, the localization code is like this:
{code}
synchronized (cachedArchives) {
get lcacheStatus
synchronized (lcacheStatus) {
increment reference count
}
}
synchronized (lcacheStatus) {
localize cache
}
{code}
The delete cache code is like this:
{code}
synchronized (cachedArchives) {
for each lcacheStatus {
synchronized (lcacheStatus) {
if (lcacheStatus.refCount == 0) {
//
}
}
}
}
{code}
The problem is when iterating to delete, if a localizing thread is localizing a
cache file for a particular cache object, the delete thread will wait to
acquire the lock on the cache object *after* acquiring the global lock. Since
the localization could take a long time, other threads will be blocked, in
effect not solving the problem we trying to.
Does this make sense ?
It seems like a very correct approach should *not* require to lock any object
when doing a costly operation like a DFS download. Other threads should wait
for a download complete event notification, or some such. But those are
sweeping changes.
One solution Amarsri and I discussed was to see if making the reference count
an AtomicInteger would help. Then maybe, its value can be read without having
to acquire a lock on the cache status object. So, the delete code will be
something like this:
{code}
synchronized (cachedArchives) {
for each lcacheStatus {
if (lcacheStatus.atomicReferenceCount.get() == 0) {
synchronized (lcacheStatus) {
// continue operation as in your patch.
}
}
}
}
{code}
Since we are guaranteed that code that's localizing a path will have the
reference count as non-zero, it will never try and proceed to the delete
operation.
Could this work ?
> Incorrect synchronization in DistributedCache causes TaskTrackers to freeze
> up during localization of Cache for tasks.
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-1098
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1098
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: tasktracker
> Reporter: Sreekanth Ramakrishnan
> Assignee: Amareshwari Sriramadasu
> Fix For: 0.21.0
>
> Attachments: MAPREDUCE-1098.patch, patch-1098-0.20.txt,
> patch-1098-1.txt, patch-1098-2.txt, patch-1098-ydist.txt, patch-1098.txt
>
>
> Currently {{org.apache.hadoop.filecache.DistributedCache.getLocalCache(URI,
> Configuration, Path, FileStatus, boolean, long, Path, boolean)}} allows only
> one {{TaskRunner}} thread in TT to localize {{DistributedCache}} across jobs.
> Current way of synchronization is across baseDir this has to be changed to
> lock on the same baseDir.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.