[
https://issues.apache.org/jira/browse/MAPREDUCE-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arun C Murthy updated MAPREDUCE-1098:
-------------------------------------
Status: Open (was: Patch Available)
This patch has a corner-case synchronization bug: it relies on
CacheStatus.markForDeletion flag, however getLocalCache and deleteCache could
silently corrupt the distributed-cache since they are looking at *different*
CacheStatus objects - there-by rendering the checks based on the
CacheStatus.markForDeletion useless.
----
The above problems arise since the DistributedCache is currently structured to
share the same underlying local file-system path across all CacheStatus
objects. Effectively there is a 1-1 mapping between between files on HDFS and
their localized counterparts.
I'm thinking a slightly different solution to the problem exhibited by this
patch is to break the 1-1 mapping between files on HDFS and the localized files
and get the CacheStatus objects to own the unique localized paths. The proposal
is to have a unique CacheStatus.localLoadPath per object and to initialize them
via copies from HDFS from src files to unique localized files. Thus we can then
continue to keep the current (correct) structure for deleteCache and put smarts
in getLocalCache to copy on init of CacheStatus.
> Incorrect synchronization in DistributedCache causes TaskTrackers to freeze
> up during localization of Cache for tasks.
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-1098
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1098
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: tasktracker
> Reporter: Sreekanth Ramakrishnan
> Assignee: Amareshwari Sriramadasu
> Fix For: 0.21.0
>
> Attachments: patch-1098-0.20.txt, patch-1098-1.txt, patch-1098-2.txt,
> patch-1098-ydist.txt, patch-1098.txt
>
>
> Currently {{org.apache.hadoop.filecache.DistributedCache.getLocalCache(URI,
> Configuration, Path, FileStatus, boolean, long, Path, boolean)}} allows only
> one {{TaskRunner}} thread in TT to localize {{DistributedCache}} across jobs.
> Current way of synchronization is across baseDir this has to be changed to
> lock on the same baseDir.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.