[
https://issues.apache.org/jira/browse/HDFS-9763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168012#comment-15168012
]
Xiaobing Zhou commented on HDFS-9763:
-------------------------------------
Thanks [~cmccabe].
{quote}
It can be implemented reasonably well outside the filesystem.
{quote}
This is true, but here the original requirement is asking for faster operation.
It’s not easy to fulfill such a goal only based on current set of API, such as
rename.
{quote}
Implementing an O(N) operation inside HDFS will cause high latencies on the
NameNode, or tricky code that needs to periodically drop the lock while
performing a single operation.
{quote}
This is a valid concern. We can cap the runtime of a single merge call by
limiting the number of files moved per call. Multiple such calls can be made to
move the contents of a large directory. Iterative merge RPCs actually trades
off NN overloading due to many rename RPC calls and high latency as a result of
single bulk renaming. moreover, leaves the admin freedom of tuning the
granularity of merge RPCs to gain performance benefits.
In a summary, the merge API is proposed to move files under one directory to
another with goals of
1). Abstracting policy of resolving name collisions for the sake of reusability
and customizability.
2). Introducing a threshold T to throttle how many files under source can be
moved to dest per merge RPC. Ceiling(N/T) iterations of merge RPCs will
eventually move all files under source to dest, where N denotes number of files
under source.
{quote}
Other filesystems and storage systems for Hadoop will have trouble implementing
merge, or may not be able to implement it at all (like s3, etc.) since they
don't have the ability to atomically move a bunch of entries.
{quote}
Atomicity is a not a goal. It’s better not to address the atomicity of merge
since this will raise general discussion about transaction support in storage
systems, which is uncommon in state of art systems, although appeared in
research file system like Coda.
{quote}
It seems like the TOCTOU in Hive can be avoided by using the rename variant
that doesn't perform overwrites. If we get an exception about the target path
already existing, we simply choose another name and try again. This also cuts
the usual number of operations to be one per file, rather than 2 per file.
{quote}
We still have the problem of overloading NN due to many rename RPC calls in a
row.
I will post a detailed proposal here.
> Add merge api
> -------------
>
> Key: HDFS-9763
> URL: https://issues.apache.org/jira/browse/HDFS-9763
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: fs
> Reporter: Ashutosh Chauhan
> Assignee: Xiaobing Zhou
>
> It will be good to add merge(Path dir1, Path dir2, ... ) api to HDFS.
> Semantics will be to move all files under dir1 to dir2 and doing a rename of
> files in case of collisions.
> In absence of this api, Hive[1] has to check for collision for each file and
> then come up unique name and try again and so on. This is inefficient in
> multiple ways:
> 1) It generates huge number of calls on NN (atleast 2*number of source files
> in dir1)
> 2) It suffers from TOCTOU[2] bug for client picked up name in case of
> collision.
> 3) Whole operation is not atomic.
> A merge api outlined as above will be immensely useful for Hive and
> potentially to other HDFS users.
> [1]
> https://github.com/apache/hive/blob/release-2.0.0-rc1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2576
> [2] https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)