[ https://issues.apache.org/jira/browse/MAPREDUCE-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Zhuge updated MAPREDUCE-6572: ---------------------------------- Description: MAPREDUCE-5899 improves distcp by supporting incremental data copy. That is, if a file is only appended since it was copied last time, only new data need to be copied. This improvement was done before HDFS truncate feature (HDFS-3107) was implemented. Since we support truncate, if a large file is truncated a little bit, the whole file will still need to be copied, even with the solution of MAPREDUCE-5899. Creating this jira to improve the situation, by possibly remembering the smallest truncated size, so there is chance to only append from that size on. HDFS tasks * Add field *minTruncateLength* to *FileWithSnapshotFeature*, default to file size. * Whenever a file is truncated, update the field. * Pass the field to HDFS client in MODIFY entry of *SnapshotDiffReport*. CopyMapper tasks * If *minTruncateLength* < *target_file_length*, CopyMapper should perform a *truncate(target_file_length - minTruncateLength)* operation. * If *minTruncateLength* < *source_file_length*, CopyMapper should perform an *append(source_file_length - minTruncateLength)* operation. In some cases, CopyMapper may perform a *truncate* followed by an *append*. was: MAPREDUCE-5899 improves distcp by supporting incremental data copy. That is, if a file is only appended since it was copied last time, only new data need to be copied. This improvement was done before HDFS truncate feature (HDFS-3107) was implemented. Since we support truncate, if a large file is truncated a little bit, the whole file will still need to be copied, even with the solution of MAPREDUCE-5899. Creating this jira to improve the situation, by possibly remembering the smallest truncated size, so there is chance to only append from that size on. Thanks. > Improved incremental data copy of distcp for truncated file > ----------------------------------------------------------- > > Key: MAPREDUCE-6572 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6572 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Yongjun Zhang > Assignee: John Zhuge > > MAPREDUCE-5899 improves distcp by supporting incremental data copy. That is, > if a file is only appended since it was copied last time, only new data need > to be copied. > This improvement was done before HDFS truncate feature (HDFS-3107) was > implemented. Since we support truncate, if a large file is truncated a little > bit, the whole file will still need to be copied, even with the solution of > MAPREDUCE-5899. > Creating this jira to improve the situation, by possibly remembering the > smallest truncated size, so there is chance to only append from that size on. > HDFS tasks > * Add field *minTruncateLength* to *FileWithSnapshotFeature*, default to file > size. > * Whenever a file is truncated, update the field. > * Pass the field to HDFS client in MODIFY entry of *SnapshotDiffReport*. > CopyMapper tasks > * If *minTruncateLength* < *target_file_length*, CopyMapper should perform a > *truncate(target_file_length - minTruncateLength)* operation. > * If *minTruncateLength* < *source_file_length*, CopyMapper should perform an > *append(source_file_length - minTruncateLength)* operation. > In some cases, CopyMapper may perform a *truncate* followed by an *append*. -- This message was sent by Atlassian JIRA (v6.3.4#6332)