[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883889#comment-13883889 ]
Jing Zhao commented on HADOOP-10295: ------------------------------------ Besides the concern on FileChecksum, some other comments on the current patch: # We may want to change "checksum" to "checksumtype" in the changes of PRESERVE_STATUS and FileAttribute. # We actually do not need to pass a FileChecksum to RetriableFileCopyCommand. In RetriableFileCopyCommand#doCopy, if we need to preserve the checksum type, we get the checksum type of the source file and we reuse this checksum in compareCheckSums(). In that case we only need to call sourceFS.getFileChecksum once (note that getFileChecksum is very costly). # We should use "FsPermission.getFileDefault().applyUMask(FsPermission.getUMask(getConf()))" in the following change (see FileSystem#create(Path, boolean, int, short, long, Progressable)) {code} - tmpTargetPath, true, BUFFER_SIZE, + tmpTargetPath, FsPermission.getFileDefault(), + EnumSet.of(CreateFlag.CREATE, CreateFlag.OVERWRITE), BUFFER_SIZE, {code} # The new added unit test does not cover there scenario where source files have different REAL checksum types (CRC32 and CRC32C), in which case copy with preserving checksum type should succeed and the original checksum types should be preserved in the target FS. We should add unit tests for this. # There are some unnecessary whilespace and blank line changes. > Allow distcp to automatically identify the checksum type of source files and > use it for the target > -------------------------------------------------------------------------------------------------- > > Key: HADOOP-10295 > URL: https://issues.apache.org/jira/browse/HADOOP-10295 > Project: Hadoop Common > Issue Type: Improvement > Affects Versions: 2.2.0 > Reporter: Jing Zhao > Assignee: Jing Zhao > Attachments: HADOOP-10295.000.patch, hadoop-10295.patch > > > Currently while doing distcp, users can use "-Ddfs.checksum.type" to specify > the checksum type in the target FS. This works fine if all the source files > are using the same checksum type. If files in the source cluster have mixed > types of checksum, users have to either use "-skipcrccheck" or have checksum > mismatching exception. Thus we may need to consider adding a new option to > distcp so that it can automatically identify the original checksum type of > each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)