[
https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167293#comment-15167293
]
Vishwajeet Dusane commented on HADOOP-12666:
--------------------------------------------
[~fabbri] Thanks a lot for your comments.
h6. For FileStatus Cache - I agree on the race condition situations. My
question about your concern is, could it break any functionality in such a
situation? and I think it would not break any common functionality. Based on
the variety of Hadoop applications we have executed with this code.
So let me try to break down the discussion based on the scenarios.
** *What is FileStatus Cache?*
*** FileStatus cache is simple process level cache which mirrors
backend storage FileStatus objects.
*** Time to live on the FileStatus cached object is limited. 5 seconds
default and configurable through core-site.xml
*** FileStatus objects are stored in Synchronized LinkedHashMap. Where
key is fully qualified file path and value is FileStatus java object along with
time to live information.
*** FileStatus cache is built based on successful responses to
GetFileStatus and ListStatus calls for existing files/folders. Non existent
files/folder are not maintained in the cache.
*** FileStatus cache motivation is to avoid multiple GetFileStatus
calls to the ADL backend and as a result gain better performance for job
startup and during execution.
I will try to break down in to some scenarios that may occur.
** *Scenario 1 : Concurrent get request for the same FileStatus object*
*** Multiple threads trying to access same FileStatus object.
Example: GetFileStatus call for path /a.txt from multiple threads
within process when FileStatus instance present in the cache.
*** Should not be a problem, Valid FileStatus object is returned to
caller across threads.
** *Scenario 2 : Concurrent put request for the same FileStatus object*
*** Multiple threads updating same FileStatus object.
{code:java}
public String thread1()
{
// FileStatus fileStatus - For storage filepath /a.txt
...
fileStatusCacheManager.put(fileStatus,5); // Race condition
...
}
...
public String thread2()
{
// FileStatus fileStatus - For storage file /a.txt
...
fileStatusCacheManager.put(fileStatus,5); // Race condition
...
}
{code}
*** Whoever wins the race, Metadata for FileStatus instance would be
constant for the same file /a.txt
*** Hence the latest and greatest value for /a.txt is valid value
anyway.
** *Scenario 3 : Concurrent get/put request for the same FileStatus object*
{code:java}
public String thread1()
{
// FileStatus fileStatus - For storage filepath /a.txt
...
fileStatusCacheManager.put(fileStatus,5); // Race condition
...
}
...
public String thread2()
{
Path f = new Path("/a.txt");
...
FileStatus fileStatus =
fileStatusCacheManager.get(makeQualified(f)); // Race condition
...
}
{code}
*** Depending upon order of execution thread2 may or may not get latest
value updated from thread1. Even synchronization of blocks are not going to
guarantee that.
*** Worst case thread2 gets NULL i.e. FileStatus object for /a.txt does
not exist in the cache so thread2 would fall back to invoke ADL backend call to
GetFileStatus.
*** Does not break any functionality in this case as well.
** *Scenario 4: Concurrent get/remove request for the same FileStatus object*
{code:java}
public String thread1()
{
Path f = new Path("/a.txt");
...
fileStatusCacheManager.remove(makeQualified(f)); // Cache
cleanup caused due to delete/rename/Create operation on /a.txt. Race condition
...
}
...
public String thread2()
{
Path f = new Path("/a.txt");
...
FileStatus fileStatus =
fileStatusCacheManager.get(makeQualified(f)); // Race condition
...
}
{code}
*** Depending upon order of execution thread2 may get stale information
from the cache. Similar to the above scenario, synchronization of blocks are
not going to solve this either
*** Unavoidable situation with/without FileStatus cache and
with/without ADL storage backend.
** *Scenario 5: Concurrent put/remove request for the different FileStatus
object*
{code:java}
public String thread1()
{
Path f = new Path("/a.txt");
...
fileStatusCacheManager.remove(makeQualified(f)); // Cache
cleanup caused due to delete/rename/Create operation on /a.txt. Race condition
...
}
...
public String thread2()
{
// FileStatus fileStatus - For storage filepath /a.txt
...
fileStatusCacheManager.put(fileStatus,5); // Race condition
...
}
{code}
*** Depending upon order of execution, FileStatus cache may hold a
stale instance for 5 seconds. Similar to above, synchronization of blocks are
not going to solve this either.
*** This is a corner case and may involve misbehavior to the
application, based on there use case. In such situation FileStatus cache should
be turned off.
h6. For volatile usage - Totally agree with you. Like i mentioned in the
earlier comment, i will remove volatile usage for those variables.
> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
> Key: HADOOP-12666
> URL: https://issues.apache.org/jira/browse/HADOOP-12666
> Project: Hadoop Common
> Issue Type: New Feature
> Components: fs, fs/azure, tools
> Reporter: Vishwajeet Dusane
> Assignee: Vishwajeet Dusane
> Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch,
> HADOOP-12666-004.patch, HADOOP-12666-005.patch, HADOOP-12666-006.patch,
> HADOOP-12666-1.patch
>
> Original Estimate: 336h
> Time Spent: 336h
> Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft
> Azure Data Lake Store (ADL) from within Hadoop. This would enable existing
> Hadoop applications such has MR, HIVE, Hbase etc.., to use ADL store as
> input or output.
>
> ADL is ultra-high capacity, Optimized for massive throughput with rich
> management and security features. More details available at
> https://azure.microsoft.com/en-us/services/data-lake-store/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)