Kai Zheng commented on HADOOP-12876:

A very interesting idea to cache FileStatus object for later use!

Good thoughts, [~ste...@apache.org]! 
bq. Would this be per FS instance, or across all instances? ...
I would suggest we start with the easy case, making the cache per instance. 
Sharing the cache across multiple instances may incur much more trouble than 
the benefit, not only the security concern as you said, but also being hard to 
enforce cache operations or sync up.

bq. Making it generic to filesystems would allow use elsewhere (Swift, S3a), as 
well as isolated testing
Well, this is really a very good idea! The optimization looks promising to all 
file systems backed by remote stores, and even HDFS I guess. Should we have a 
base issue to lay out the foundation first, not specific to any cloud? After 
that done, we then can have this done very simply since very probably the major 
work has already been done in the common base. Maybe an abstract class like 
{{CloudFileSystem extends FileSystem}} for all cloud backed file system to 
inherit with? This way we can share many codes among all the specific 

bq. would there be cache invalidation after file/directory operations?
Right. Would need to consider how and when cached FileStatus objects are to be 
invalidated/cleaned. Think about a file is listed but shortly deleted 
afterwards ...

Does this need help? If yes it'd be good to break this down and my side can 
take some tasks. Thanks.

> [Azure Data Lake] Support for process level FileStatus cache to optimize 
> GetFileStatus frequent opeations
> ---------------------------------------------------------------------------------------------------------
>                 Key: HADOOP-12876
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12876
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs, fs/azure, tools
>            Reporter: Vishwajeet Dusane
>            Assignee: Vishwajeet Dusane
> Add support to cache GetFileStatus and ListStatus response locally for 
> limited period of time. Local cache for limited period of time would optimize 
> number of calls for GetFileStatus operation.
> One of the example  where local limited period cache would be useful - 
> terasort ListStatus on input directory follows with GetFileStatus operation 
> on each file within directory. For 2048 input files in a directory would save 
> 2048 GetFileStatus calls during start up (Using the ListStatus response to 
> cache FileStatus instances).

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to