[ 
https://issues.apache.org/jira/browse/HADOOP-18056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17464943#comment-17464943
 ] 

Ayush Saxena commented on HADOOP-18056:
---------------------------------------

In general Distcp is used in many migration and replication setups. In ideal 
case the client shouldn't pass the duplicate paths, but in case something gets 
screwed up in the logic and accidentally, exact same path gets added twice, I 
think adding this much smartness won't hurt.

Ex. In hive for replication the paths for which distcp has to be triggered is 
persisted in the NotificationLog in the RDBMS. If due to some issues, we get a 
duplicate entry there the DistCp shall fail. Even if we fix the cause of 
duplicates, still the older entry will persist. We can not run an Alter query 
on NotificationLog as such to remove the duplicates and retries. Bunch of 
reasons for that....

I feel adding this much basic smartness won't hurt...

> DistCp: Filter duplicates in the source paths
> ---------------------------------------------
>
>                 Key: HADOOP-18056
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18056
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Ayush Saxena
>            Assignee: Ayush Saxena
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Add a basic filtering to remove the exact duplicate paths exposed for copying.
> In case two same srcPath say /tmp/file1 is passed in the list twice. DistCp 
> fails with DuplicateFileException, post building the listing.
> Would be better if we do a basic filtering of duplicate paths. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to