[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887651#comment-13887651
 ] 

Hudson commented on HADOOP-10295:
-

SUCCESS: Integrated in Hadoop-Yarn-trunk #467 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/467/])
HADOOP-10295. Allow distcp to automatically identify the checksum type of 
source files and use it for the target. Contributed by Jing Zhao and Laurent 
Goujon. (jing9: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1563019)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileChecksum.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java


 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
  Components: tools/distcp
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Fix For: 3.0.0, 2.4.0

 Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, 
 hadoop-10295.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887735#comment-13887735
 ] 

Hudson commented on HADOOP-10295:
-

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1684 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1684/])
HADOOP-10295. Allow distcp to automatically identify the checksum type of 
source files and use it for the target. Contributed by Jing Zhao and Laurent 
Goujon. (jing9: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1563019)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileChecksum.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java


 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
  Components: tools/distcp
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Fix For: 3.0.0, 2.4.0

 Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, 
 hadoop-10295.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887749#comment-13887749
 ] 

Hudson commented on HADOOP-10295:
-

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1659 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1659/])
HADOOP-10295. Allow distcp to automatically identify the checksum type of 
source files and use it for the target. Contributed by Jing Zhao and Laurent 
Goujon. (jing9: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1563019)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileChecksum.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java


 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
  Components: tools/distcp
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Fix For: 3.0.0, 2.4.0

 Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, 
 hadoop-10295.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-30 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886903#comment-13886903
 ] 

Jing Zhao commented on HADOOP-10295:


Thanks for the review, Nicholas and Sangjin!

[~sjlee0], that is originally implicitly contained in the FileSystem#create 
call (see FileSystem#create(Path, boolean, int, short, long, Progressable)). I 
just pulled it out to make the code not too long.

 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
  Components: tools/distcp
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, 
 hadoop-10295.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-30 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886894#comment-13886894
 ] 

Sangjin Lee commented on HADOOP-10295:
--

The patch looks good.

Just one question. I see now there is an explicit call to create the permission 
in copyToTmpFile(). What is the nature of this change? Was the same thing being 
done implicitly and it is just made explicit, or is there another reason?

 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
  Components: tools/distcp
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, 
 hadoop-10295.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-30 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886975#comment-13886975
 ] 

Jing Zhao commented on HADOOP-10295:


I will commit this patch later today if there is no more comment.

 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
  Components: tools/distcp
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, 
 hadoop-10295.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887280#comment-13887280
 ] 

Hudson commented on HADOOP-10295:
-

SUCCESS: Integrated in Hadoop-trunk-Commit #5077 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5077/])
HADOOP-10295. Allow distcp to automatically identify the checksum type of 
source files and use it for the target. Contributed by Jing Zhao and Laurent 
Goujon. (jing9: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1563019)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileChecksum.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java


 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
  Components: tools/distcp
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Fix For: 3.0.0, 2.4.0

 Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, 
 hadoop-10295.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-28 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883889#comment-13883889
 ] 

Jing Zhao commented on HADOOP-10295:


Besides the concern on FileChecksum, some other comments on the current patch:
# We may want to change checksum to checksumtype in the changes of 
PRESERVE_STATUS and FileAttribute.
# We actually do not need to pass a FileChecksum to RetriableFileCopyCommand. 
In RetriableFileCopyCommand#doCopy, if we need to preserve the checksum type, 
we get the checksum type of the source file and we reuse this checksum in 
compareCheckSums(). In that case we only need to call sourceFS.getFileChecksum 
once (note that getFileChecksum is very costly).
# We should use 
FsPermission.getFileDefault().applyUMask(FsPermission.getUMask(getConf())) in 
the following change (see FileSystem#create(Path, boolean, int, short, long, 
Progressable))
{code}
-tmpTargetPath, true, BUFFER_SIZE,
+tmpTargetPath, FsPermission.getFileDefault(), 
+EnumSet.of(CreateFlag.CREATE, CreateFlag.OVERWRITE), BUFFER_SIZE,
{code}
# The new added unit test does not cover there scenario where source files have 
different REAL checksum types (CRC32 and CRC32C), in which case copy with 
preserving checksum type should succeed and the original checksum types should 
be preserved in the target FS. We should add unit tests for this. 
# There are some unnecessary whilespace and blank line changes.

 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HADOOP-10295.000.patch, hadoop-10295.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-28 Thread Laurent Goujon (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884206#comment-13884206
 ] 

Laurent Goujon commented on HADOOP-10295:
-

For point 3, I was using {{getFileDefault()}} because it was the previous 
behavior, and in {{CopyMapper.map(...)}, once copy succeed, a call is made to 
{{DistCpUtils.preserve(...)}} which sets the owner, group, replication and 
permissions. Should it be refactored?

 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HADOOP-10295.000.patch, hadoop-10295.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-28 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884212#comment-13884212
 ] 

Kihwal Lee commented on HADOOP-10295:
-

Thanks for working on this, Jing.  One thing to note is that the block size 
needs to be identical in addition to the checksum parameters in order for the 
checksums to match. So it might make more sense to introduce an option to 
preserve the two together.

 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HADOOP-10295.000.patch, hadoop-10295.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-28 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884301#comment-13884301
 ] 

Sangjin Lee commented on HADOOP-10295:
--

Agree the option needs to mean that the checksum algorithm *and* the blocksize 
are preserved.

 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HADOOP-10295.000.patch, hadoop-10295.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884984#comment-13884984
 ] 

Hadoop QA commented on HADOOP-10295:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12625778/HADOOP-10295.002.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common hadoop-tools/hadoop-distcp.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3498//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3498//console

This message is automatically generated.

 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HADOOP-10295.000.patch, HADOOP-10295.002.patch, 
 hadoop-10295.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-27 Thread Laurent Goujon (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883830#comment-13883830
 ] 

Laurent Goujon commented on HADOOP-10295:
-

Funny, I have been preparing a patch for this very same issue for a week.

Some comments regarding your patch:
* instead of a new commandline option, it may be better to extend FileAttribute 
enum
* MD5MD5CRC32GzipFileChecksum and MD5MD5CRC32CastagnoliFileChecksum are 
probably HDFS specific (although being available in hadoop-common). I opened 
HADOOP-10297 for having {{FileChecksum.getChecksumOpt()}}
* Instead of doing two instanceof check, it is possible to use the super class 
MD5MD5CRC32FileChecksum
* EnumSet.of(CreateFlag.OVERWRITE) is not equivalent of setting overwrite 
argument to true. From DistributedFileSystem, it is 
EnumSet.of(CreateFlag.CREATE, CreateFlag.OVERWRITE)
* Having a test to check if the option actually works would be a nice to have 
(according to me)

Since I also have a patch, I'll attach it to this ticket to, and let have a 
hadoop maintainer help us sorting them out :) 

 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HADOOP-10295.000.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-27 Thread Laurent Goujon (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883835#comment-13883835
 ] 

Laurent Goujon commented on HADOOP-10295:
-

[~jingzhao] It seems you are actually a hadoop commiter, so it's just great. 
Hope you'll find my patch helpful and you'll be able to add this feature soon!

 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HADOOP-10295.000.patch, hadoop-10295.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-27 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883852#comment-13883852
 ] 

Jing Zhao commented on HADOOP-10295:


Thanks for the comment [~laurentgo]!

bq. EnumSet.of(CreateFlag.OVERWRITE) is not equivalent of setting overwrite 
argument to true. From DistributedFileSystem, it is 
EnumSet.of(CreateFlag.CREATE, CreateFlag.OVERWRITE)
That's right. I also found this problem in my patch.

bq. MD5MD5CRC32GzipFileChecksum and MD5MD5CRC32CastagnoliFileChecksum are 
probably HDFS specific
I personally like your idea in HADOOP-10297. That can simplify the logic there. 
However, FileChecksum is a public API marked as stable, to add a new abstract 
method there may cause incompatibility (e.g., other ppl may have implemented 
their own FileChecksum). A workaround here can be adding getChecksumOpt() to 
FileChecksum and let it return null.

bq. Having a test to check if the option actually works would be a nice to have
Totally agree. Actually I've added a new unit test in my 001 patch, and the new 
unit test is very similar to yours :)

bq. it may be better to extend FileAttribute enum
I thought about this problem. To me checksum type may be a little bit different 
from other file attributes, since other file attributes are all metadata stored 
in NN. Thus in my first patch I just add a new option. But now I think to put 
the checksum type in the FileAttribute enum should be more clear.

Currently I have a 001 patch which fixes the CreateFlag bug and adds a unit 
test. My original plan is to post it after I finish system test in my local 
cluster. But since you've worked on this issue for some time and already have a 
decent patch, I'd like to review your patch and commit it when it is ready. 



 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HADOOP-10295.000.patch, hadoop-10295.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

2014-01-27 Thread Laurent Goujon (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883865#comment-13883865
 ] 

Laurent Goujon commented on HADOOP-10295:
-

{quote}
I personally like your idea in HADOOP-10297. That can simplify the logic there. 
However, FileChecksum is a public API marked as stable, to add a new abstract 
method there may cause incompatibility (e.g., other ppl may have implemented 
their own FileChecksum). A workaround here can be adding getChecksumOpt() to 
FileChecksum and let it return null.
{quote}
Yes, my patch for HADOOP-10297 breaks source compatibility (but not binary 
compatibility). It may be okay for next Hadoop major version, but probably not 
for a minor version. Waiting for some guidance here (and it's really easy to 
change)

{quote}
I thought about this problem. To me checksum type may be a little bit different 
from other file attributes, since other file attributes are all metadata stored 
in NN. Thus in my first patch I just add a new option. But now I think to put 
the checksum type in the FileAttribute enum should be more clear.
{quote}
From the user point of view, block size, replication and checksum option are 
seen as the same kind of metadata. Only from the FileSystem API, it is seen as 
different kind of metadata because the information is not stored in the same 
place.

{quote}
Currently I have a 001 patch which fixes the CreateFlag bug and adds a unit 
test. My original plan is to post it after I finish system test in my local 
cluster. But since you've worked on this issue for some time and already have a 
decent patch, I'd like to review your patch and commit it when it is ready. 
{quote}
My patch is mostly ready I think, but it is blocked by the other tickets I 
mentioned. Hopefully they will be reviewed quickly.

 Allow distcp to automatically identify the checksum type of source files and 
 use it for the target
 --

 Key: HADOOP-10295
 URL: https://issues.apache.org/jira/browse/HADOOP-10295
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 2.2.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HADOOP-10295.000.patch, hadoop-10295.patch


 Currently while doing distcp, users can use -Ddfs.checksum.type to specify 
 the checksum type in the target FS. This works fine if all the source files 
 are using the same checksum type. If files in the source cluster have mixed 
 types of checksum, users have to either use -skipcrccheck or have checksum 
 mismatching exception. Thus we may need to consider adding a new option to 
 distcp so that it can automatically identify the original checksum type of 
 each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)