[ 
https://issues.apache.org/jira/browse/HDFS-9763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176355#comment-15176355
 ] 

Colin Patrick McCabe commented on HDFS-9763:
--------------------------------------------

-1 for the proposed merge API for the reasons [~wheat9] and I stated earlier.  
It's complicated, Hive-specific, locks us into the current Hive semantics, and 
isn't needed to address the TOCTOU.

If the goal is reducing the number of RPCs to the NameNode that Hive makes, 
there are much simpler ways to do that... like allowing a single RPC to contain 
multiple HDFS requests.  We could just have a generic batch API that allows the 
client to send multiple requests as part of a "batch".  Sending a bunch of 
renames in one RPC would just be one use for this API.  It would be useful for 
applications other than Hive, and would allow Hive to change its merge 
semantics over time without modifying the source code of HDFS.

> Add merge api
> -------------
>
>                 Key: HDFS-9763
>                 URL: https://issues.apache.org/jira/browse/HDFS-9763
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Ashutosh Chauhan
>            Assignee: Xiaobing Zhou
>         Attachments: HDFS_Merge_API_Proposal.pdf
>
>
> It will be good to add merge(Path dir1, Path dir2, ... ) api to HDFS. 
> Semantics will be to move all files under dir1 to dir2 and doing a rename of 
> files in case of collisions.
> In absence of this api, Hive[1] has to check for collision for each file and 
> then come up unique name and try again and so on. This is inefficient in 
> multiple ways:
> 1) It generates huge number of calls on NN (atleast 2*number of source files 
> in dir1)
> 2) It suffers from TOCTOU[2] bug for client picked up name in case of 
> collision.
> 3) Whole operation is not atomic.
> A merge api outlined as above will be immensely useful for Hive and 
> potentially to other HDFS users.
> [1] 
> https://github.com/apache/hive/blob/release-2.0.0-rc1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2576
> [2] https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to