[ 
https://issues.apache.org/jira/browse/HDFS-9763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168012#comment-15168012
 ] 

Xiaobing Zhou commented on HDFS-9763:
-------------------------------------

Thanks [~cmccabe].
{quote}
It can be implemented reasonably well outside the filesystem.
{quote}
This is true, but here the original requirement is asking for faster operation. 
It’s not easy to fulfill such a goal only based on current set of API, such as 
rename. 

{quote}
Implementing an O(N) operation inside HDFS will cause high latencies on the 
NameNode, or tricky code that needs to periodically drop the lock while 
performing a single operation.
{quote}
This is a valid concern. We can cap the runtime of a single merge call by 
limiting the number of files moved per call. Multiple such calls can be made to 
move the contents of a large directory. Iterative merge RPCs actually trades 
off NN overloading due to many rename RPC calls and high latency as a result of 
single bulk renaming. moreover, leaves the admin freedom of tuning the 
granularity of merge RPCs to gain performance benefits.

In a summary, the merge API is proposed to move files under one directory to 
another with goals of
1). Abstracting policy of resolving name collisions for the sake of reusability 
and customizability.
2). Introducing a threshold T to throttle how many files under source can be 
moved to dest per merge RPC. Ceiling(N/T) iterations of merge RPCs will 
eventually move all files under source to dest, where N denotes number of files 
under source.

{quote}
Other filesystems and storage systems for Hadoop will have trouble implementing 
merge, or may not be able to implement it at all (like s3, etc.) since they 
don't have the ability to atomically move a bunch of entries.
{quote}
Atomicity is a not a goal. It’s better not to address the atomicity of merge 
since this will raise general discussion about transaction support in storage 
systems, which is uncommon in state of art systems, although appeared in 
research file system like Coda.

{quote}
It seems like the TOCTOU in Hive can be avoided by using the rename variant 
that doesn't perform overwrites. If we get an exception about the target path 
already existing, we simply choose another name and try again. This also cuts 
the usual number of operations to be one per file, rather than 2 per file.
{quote}
We still have the problem of overloading NN due to many rename RPC calls in a 
row.

I will post a detailed proposal here.

> Add merge api
> -------------
>
>                 Key: HDFS-9763
>                 URL: https://issues.apache.org/jira/browse/HDFS-9763
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Ashutosh Chauhan
>            Assignee: Xiaobing Zhou
>
> It will be good to add merge(Path dir1, Path dir2, ... ) api to HDFS. 
> Semantics will be to move all files under dir1 to dir2 and doing a rename of 
> files in case of collisions.
> In absence of this api, Hive[1] has to check for collision for each file and 
> then come up unique name and try again and so on. This is inefficient in 
> multiple ways:
> 1) It generates huge number of calls on NN (atleast 2*number of source files 
> in dir1)
> 2) It suffers from TOCTOU[2] bug for client picked up name in case of 
> collision.
> 3) Whole operation is not atomic.
> A merge api outlined as above will be immensely useful for Hive and 
> potentially to other HDFS users.
> [1] 
> https://github.com/apache/hive/blob/release-2.0.0-rc1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2576
> [2] https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to