[jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop

Vishwajeet Dusane (JIRA) Thu, 25 Feb 2016 07:08:10 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167293#comment-15167293
 ]


Vishwajeet Dusane commented on HADOOP-12666:
--------------------------------------------

 [~fabbri] Thanks a lot for your comments.

h6. For FileStatus Cache - I agree on the race condition situations. My 
question about your concern is, could it break any functionality in such a 
situation? and I think it would not break any common functionality. Based on 
the variety of Hadoop applications we have executed with this code.

So let me try to break down the discussion based on the scenarios.
** *What is FileStatus Cache?* 
        *** FileStatus cache is simple process level cache which mirrors 
backend storage FileStatus objects.
        *** Time to live on the FileStatus cached object is limited. 5 seconds 
default and configurable through core-site.xml
        *** FileStatus objects are stored in Synchronized LinkedHashMap. Where 
key is fully qualified file path and value is FileStatus java object along with 
time to live information.
        *** FileStatus cache is built based on successful responses to 
GetFileStatus and ListStatus calls for existing files/folders. Non existent 
files/folder are not maintained in the cache.
        *** FileStatus cache motivation is to avoid multiple GetFileStatus 
calls to the ADL backend and as a result gain better performance for job 
startup and during execution.
I will try to break down in to some scenarios that may occur.
** *Scenario 1 : Concurrent get request for the same FileStatus object*
        *** Multiple threads trying to access same FileStatus object.
        Example: GetFileStatus call for path /a.txt from multiple threads 
within process when FileStatus instance present in the cache.
        *** Should not be a problem, Valid FileStatus object is returned to 
caller across threads.
** *Scenario 2 : Concurrent put request for the same FileStatus object*
        *** Multiple threads updating same FileStatus object.
        {code:java}
        public String thread1()
        {
            // FileStatus fileStatus - For storage filepath /a.txt 
                ...
                fileStatusCacheManager.put(fileStatus,5); // Race condition
                ...
        }
        ...
        public String thread2()
        {
            // FileStatus fileStatus - For storage file /a.txt 
                ...
                fileStatusCacheManager.put(fileStatus,5); // Race condition
                ...
        }
        {code}
        *** Whoever wins the race, Metadata for FileStatus instance would be 
constant for the same file /a.txt
        *** Hence the latest and greatest value for /a.txt is valid value 
anyway.
** *Scenario 3 : Concurrent get/put request for the same FileStatus object*
        {code:java}
        public String thread1()
        {
            // FileStatus fileStatus - For storage filepath /a.txt 
                ...
                fileStatusCacheManager.put(fileStatus,5); // Race condition
                ...
        }
        ...
        public String thread2()
        {
            Path f = new Path("/a.txt");
                ...
                FileStatus fileStatus = 
fileStatusCacheManager.get(makeQualified(f)); // Race condition
                ...
        }
        {code}
        *** Depending upon order of execution thread2 may or may not get latest 
value updated from thread1. Even synchronization of blocks are not going to 
guarantee that.
        *** Worst case thread2 gets NULL i.e. FileStatus object for /a.txt does 
not exist in the cache so thread2 would fall back to invoke ADL backend call to 
GetFileStatus.
        *** Does not break any functionality in this case as well.
** *Scenario 4: Concurrent get/remove request for the same FileStatus object*
        {code:java}
        public String thread1()
        {
            Path f = new Path("/a.txt");
                ...
                fileStatusCacheManager.remove(makeQualified(f)); // Cache 
cleanup caused due to delete/rename/Create operation on /a.txt. Race condition
                ...
        }
        ...
        public String thread2()
        {
            Path f = new Path("/a.txt");
                ...
                FileStatus fileStatus = 
fileStatusCacheManager.get(makeQualified(f)); // Race condition
                ...
        }
        {code}
        *** Depending upon order of execution thread2 may get stale information 
from the cache. Similar to the above scenario, synchronization of blocks are 
not going to solve this either
        *** Unavoidable situation with/without FileStatus cache and 
with/without ADL storage backend.
** *Scenario 5: Concurrent put/remove request for the different FileStatus 
object*
        {code:java}
        public String thread1()
        {
            Path f = new Path("/a.txt");
                ...
                fileStatusCacheManager.remove(makeQualified(f)); // Cache 
cleanup caused due to delete/rename/Create operation on /a.txt. Race condition
                ...
        }
        ...
        public String thread2()
        {
            // FileStatus fileStatus - For storage filepath /a.txt 
                ...
                fileStatusCacheManager.put(fileStatus,5); // Race condition
                ...
        }
        {code}
        *** Depending upon order of execution, FileStatus cache may hold a 
stale instance for 5 seconds. Similar to above, synchronization of blocks are 
not going to solve this either.
        *** This is a corner case and may involve misbehavior to the 
application, based on there use case. In such situation FileStatus cache should 
be turned off.

h6. For volatile usage - Totally agree with you. Like i mentioned in the 
earlier comment, i will remove volatile usage for those variables.

> Support Microsoft Azure Data Lake - as a file system in Hadoop
> --------------------------------------------------------------
>
>                 Key: HADOOP-12666
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12666
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, tools
>            Reporter: Vishwajeet Dusane
>            Assignee: Vishwajeet Dusane
>         Attachments: HADOOP-12666-002.patch, HADOOP-12666-003.patch, 
> HADOOP-12666-004.patch, HADOOP-12666-005.patch, HADOOP-12666-006.patch, 
> HADOOP-12666-1.patch
>
>   Original Estimate: 336h
>          Time Spent: 336h
>  Remaining Estimate: 0h
>
> h2. Description
> This JIRA describes a new file system implementation for accessing Microsoft 
> Azure Data Lake Store (ADL) from within Hadoop. This would enable existing 
> Hadoop applications such has MR, HIVE, Hbase etc..,  to use ADL store as 
> input or output.
>  
> ADL is ultra-high capacity, Optimized for massive throughput with rich 
> management and security features. More details available at 
> https://azure.microsoft.com/en-us/services/data-lake-store/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HADOOP-12666) Support Microsoft Azure Data Lake - as a file system in Hadoop

Reply via email to