Zheng Shao created MAPREDUCE-6840:
-------------------------------------

             Summary: Distcp to support cutoff time
                 Key: MAPREDUCE-6840
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6840
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: distcp
    Affects Versions: 2.6.0
            Reporter: Zheng Shao
            Assignee: Zheng Shao
            Priority: Minor


To ensure consistency in the datasets on HDFS,  some projects like file formats 
on Hive do HDFS operations in a particular order.  For example, if a file 
format uses an index file, a new version of the index file will only be written 
to HDFS after all files mentioned by the index are written to HDFS.

When we do distcp, it's important to preserve that consistency, so that we 
don't break those file formats.

A typical solution for that is to create a HDFS Snapshot beforehand, and only 
distcp the Snapshot.  That could work well if the user has superuser privilege 
to make the directory snapshottable.

If not, then it will be beneficial to have a cutoff time for distcp, so that 
distcp only copy files modified on/before that cutoff time.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to