[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887651#comment-13887651 ] Hudson commented on HADOOP-10295: - SUCCESS: Integrated in Hadoop-Yarn-trunk #467 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/467/]) HADOOP-10295. Allow distcp to automatically identify the checksum type of source files and use it for the target. Contributed by Jing Zhao and Laurent Goujon. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1563019) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileChecksum.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Fix For: 3.0.0, 2.4.0 Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, hadoop-10295.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887735#comment-13887735 ] Hudson commented on HADOOP-10295: - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1684 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1684/]) HADOOP-10295. Allow distcp to automatically identify the checksum type of source files and use it for the target. Contributed by Jing Zhao and Laurent Goujon. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1563019) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileChecksum.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Fix For: 3.0.0, 2.4.0 Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, hadoop-10295.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887749#comment-13887749 ] Hudson commented on HADOOP-10295: - SUCCESS: Integrated in Hadoop-Hdfs-trunk #1659 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1659/]) HADOOP-10295. Allow distcp to automatically identify the checksum type of source files and use it for the target. Contributed by Jing Zhao and Laurent Goujon. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1563019) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileChecksum.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Fix For: 3.0.0, 2.4.0 Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, hadoop-10295.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886903#comment-13886903 ] Jing Zhao commented on HADOOP-10295: Thanks for the review, Nicholas and Sangjin! [~sjlee0], that is originally implicitly contained in the FileSystem#create call (see FileSystem#create(Path, boolean, int, short, long, Progressable)). I just pulled it out to make the code not too long. Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, hadoop-10295.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886894#comment-13886894 ] Sangjin Lee commented on HADOOP-10295: -- The patch looks good. Just one question. I see now there is an explicit call to create the permission in copyToTmpFile(). What is the nature of this change? Was the same thing being done implicitly and it is just made explicit, or is there another reason? Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, hadoop-10295.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886975#comment-13886975 ] Jing Zhao commented on HADOOP-10295: I will commit this patch later today if there is no more comment. Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, hadoop-10295.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887280#comment-13887280 ] Hudson commented on HADOOP-10295: - SUCCESS: Integrated in Hadoop-trunk-Commit #5077 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5077/]) HADOOP-10295. Allow distcp to automatically identify the checksum type of source files and use it for the target. Contributed by Jing Zhao and Laurent Goujon. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1563019) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileChecksum.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java * /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Components: tools/distcp Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Fix For: 3.0.0, 2.4.0 Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, hadoop-10295.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883889#comment-13883889 ] Jing Zhao commented on HADOOP-10295: Besides the concern on FileChecksum, some other comments on the current patch: # We may want to change checksum to checksumtype in the changes of PRESERVE_STATUS and FileAttribute. # We actually do not need to pass a FileChecksum to RetriableFileCopyCommand. In RetriableFileCopyCommand#doCopy, if we need to preserve the checksum type, we get the checksum type of the source file and we reuse this checksum in compareCheckSums(). In that case we only need to call sourceFS.getFileChecksum once (note that getFileChecksum is very costly). # We should use FsPermission.getFileDefault().applyUMask(FsPermission.getUMask(getConf())) in the following change (see FileSystem#create(Path, boolean, int, short, long, Progressable)) {code} -tmpTargetPath, true, BUFFER_SIZE, +tmpTargetPath, FsPermission.getFileDefault(), +EnumSet.of(CreateFlag.CREATE, CreateFlag.OVERWRITE), BUFFER_SIZE, {code} # The new added unit test does not cover there scenario where source files have different REAL checksum types (CRC32 and CRC32C), in which case copy with preserving checksum type should succeed and the original checksum types should be preserved in the target FS. We should add unit tests for this. # There are some unnecessary whilespace and blank line changes. Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HADOOP-10295.000.patch, hadoop-10295.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884206#comment-13884206 ] Laurent Goujon commented on HADOOP-10295: - For point 3, I was using {{getFileDefault()}} because it was the previous behavior, and in {{CopyMapper.map(...)}, once copy succeed, a call is made to {{DistCpUtils.preserve(...)}} which sets the owner, group, replication and permissions. Should it be refactored? Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HADOOP-10295.000.patch, hadoop-10295.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884212#comment-13884212 ] Kihwal Lee commented on HADOOP-10295: - Thanks for working on this, Jing. One thing to note is that the block size needs to be identical in addition to the checksum parameters in order for the checksums to match. So it might make more sense to introduce an option to preserve the two together. Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HADOOP-10295.000.patch, hadoop-10295.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884301#comment-13884301 ] Sangjin Lee commented on HADOOP-10295: -- Agree the option needs to mean that the checksum algorithm *and* the blocksize are preserved. Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HADOOP-10295.000.patch, hadoop-10295.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884984#comment-13884984 ] Hadoop QA commented on HADOOP-10295: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12625778/HADOOP-10295.002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-tools/hadoop-distcp. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/3498//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/3498//console This message is automatically generated. Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, hadoop-10295.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883830#comment-13883830 ] Laurent Goujon commented on HADOOP-10295: - Funny, I have been preparing a patch for this very same issue for a week. Some comments regarding your patch: * instead of a new commandline option, it may be better to extend FileAttribute enum * MD5MD5CRC32GzipFileChecksum and MD5MD5CRC32CastagnoliFileChecksum are probably HDFS specific (although being available in hadoop-common). I opened HADOOP-10297 for having {{FileChecksum.getChecksumOpt()}} * Instead of doing two instanceof check, it is possible to use the super class MD5MD5CRC32FileChecksum * EnumSet.of(CreateFlag.OVERWRITE) is not equivalent of setting overwrite argument to true. From DistributedFileSystem, it is EnumSet.of(CreateFlag.CREATE, CreateFlag.OVERWRITE) * Having a test to check if the option actually works would be a nice to have (according to me) Since I also have a patch, I'll attach it to this ticket to, and let have a hadoop maintainer help us sorting them out :) Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HADOOP-10295.000.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883835#comment-13883835 ] Laurent Goujon commented on HADOOP-10295: - [~jingzhao] It seems you are actually a hadoop commiter, so it's just great. Hope you'll find my patch helpful and you'll be able to add this feature soon! Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HADOOP-10295.000.patch, hadoop-10295.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883852#comment-13883852 ] Jing Zhao commented on HADOOP-10295: Thanks for the comment [~laurentgo]! bq. EnumSet.of(CreateFlag.OVERWRITE) is not equivalent of setting overwrite argument to true. From DistributedFileSystem, it is EnumSet.of(CreateFlag.CREATE, CreateFlag.OVERWRITE) That's right. I also found this problem in my patch. bq. MD5MD5CRC32GzipFileChecksum and MD5MD5CRC32CastagnoliFileChecksum are probably HDFS specific I personally like your idea in HADOOP-10297. That can simplify the logic there. However, FileChecksum is a public API marked as stable, to add a new abstract method there may cause incompatibility (e.g., other ppl may have implemented their own FileChecksum). A workaround here can be adding getChecksumOpt() to FileChecksum and let it return null. bq. Having a test to check if the option actually works would be a nice to have Totally agree. Actually I've added a new unit test in my 001 patch, and the new unit test is very similar to yours :) bq. it may be better to extend FileAttribute enum I thought about this problem. To me checksum type may be a little bit different from other file attributes, since other file attributes are all metadata stored in NN. Thus in my first patch I just add a new option. But now I think to put the checksum type in the FileAttribute enum should be more clear. Currently I have a 001 patch which fixes the CreateFlag bug and adds a unit test. My original plan is to post it after I finish system test in my local cluster. But since you've worked on this issue for some time and already have a decent patch, I'd like to review your patch and commit it when it is ready. Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HADOOP-10295.000.patch, hadoop-10295.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target
[ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883865#comment-13883865 ] Laurent Goujon commented on HADOOP-10295: - {quote} I personally like your idea in HADOOP-10297. That can simplify the logic there. However, FileChecksum is a public API marked as stable, to add a new abstract method there may cause incompatibility (e.g., other ppl may have implemented their own FileChecksum). A workaround here can be adding getChecksumOpt() to FileChecksum and let it return null. {quote} Yes, my patch for HADOOP-10297 breaks source compatibility (but not binary compatibility). It may be okay for next Hadoop major version, but probably not for a minor version. Waiting for some guidance here (and it's really easy to change) {quote} I thought about this problem. To me checksum type may be a little bit different from other file attributes, since other file attributes are all metadata stored in NN. Thus in my first patch I just add a new option. But now I think to put the checksum type in the FileAttribute enum should be more clear. {quote} From the user point of view, block size, replication and checksum option are seen as the same kind of metadata. Only from the FileSystem API, it is seen as different kind of metadata because the information is not stored in the same place. {quote} Currently I have a 001 patch which fixes the CreateFlag bug and adds a unit test. My original plan is to post it after I finish system test in my local cluster. But since you've worked on this issue for some time and already have a decent patch, I'd like to review your patch and commit it when it is ready. {quote} My patch is mostly ready I think, but it is blocked by the other tickets I mentioned. Hopefully they will be reviewed quickly. Allow distcp to automatically identify the checksum type of source files and use it for the target -- Key: HADOOP-10295 URL: https://issues.apache.org/jira/browse/HADOOP-10295 Project: Hadoop Common Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HADOOP-10295.000.patch, hadoop-10295.patch Currently while doing distcp, users can use -Ddfs.checksum.type to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use -skipcrccheck or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. -- This message was sent by Atlassian JIRA (v6.1.5#6160)