[ https://issues.apache.org/jira/browse/MAPREDUCE-6840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zheng Shao updated MAPREDUCE-6840: ---------------------------------- Attachment: MAPREDUCE-6840.1.patch > Distcp to support cutoff time > ----------------------------- > > Key: MAPREDUCE-6840 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6840 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp > Affects Versions: 2.6.0 > Reporter: Zheng Shao > Assignee: Zheng Shao > Priority: Minor > Attachments: MAPREDUCE-6840.1.patch > > > To ensure consistency in the datasets on HDFS, some projects like file > formats on Hive do HDFS operations in a particular order. For example, if a > file format uses an index file, a new version of the index file will only be > written to HDFS after all files mentioned by the index are written to HDFS. > When we do distcp, it's important to preserve that consistency, so that we > don't break those file formats. > A typical solution for that is to create a HDFS Snapshot beforehand, and only > distcp the Snapshot. That could work well if the user has superuser > privilege to make the directory snapshottable. > If not, then it will be beneficial to have a cutoff time for distcp, so that > distcp only copy files modified on/before that cutoff time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org