[
https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883852#comment-13883852
]
Jing Zhao commented on HADOOP-10295:
------------------------------------
Thanks for the comment [~laurentgo]!
bq. EnumSet.of(CreateFlag.OVERWRITE) is not equivalent of setting overwrite
argument to true. From DistributedFileSystem, it is
EnumSet.of(CreateFlag.CREATE, CreateFlag.OVERWRITE)
That's right. I also found this problem in my patch.
bq. MD5MD5CRC32GzipFileChecksum and MD5MD5CRC32CastagnoliFileChecksum are
probably HDFS specific
I personally like your idea in HADOOP-10297. That can simplify the logic there.
However, FileChecksum is a public API marked as stable, to add a new abstract
method there may cause incompatibility (e.g., other ppl may have implemented
their own FileChecksum). A workaround here can be adding getChecksumOpt() to
FileChecksum and let it return null.
bq. Having a test to check if the option actually works would be a nice to have
Totally agree. Actually I've added a new unit test in my 001 patch, and the new
unit test is very similar to yours :)
bq. it may be better to extend FileAttribute enum
I thought about this problem. To me checksum type may be a little bit different
from other file attributes, since other file attributes are all metadata stored
in NN. Thus in my first patch I just add a new option. But now I think to put
the checksum type in the FileAttribute enum should be more clear.
Currently I have a 001 patch which fixes the CreateFlag bug and adds a unit
test. My original plan is to post it after I finish system test in my local
cluster. But since you've worked on this issue for some time and already have a
decent patch, I'd like to review your patch and commit it when it is ready.
> Allow distcp to automatically identify the checksum type of source files and
> use it for the target
> --------------------------------------------------------------------------------------------------
>
> Key: HADOOP-10295
> URL: https://issues.apache.org/jira/browse/HADOOP-10295
> Project: Hadoop Common
> Issue Type: Improvement
> Affects Versions: 2.2.0
> Reporter: Jing Zhao
> Assignee: Jing Zhao
> Attachments: HADOOP-10295.000.patch, hadoop-10295.patch
>
>
> Currently while doing distcp, users can use "-Ddfs.checksum.type" to specify
> the checksum type in the target FS. This works fine if all the source files
> are using the same checksum type. If files in the source cluster have mixed
> types of checksum, users have to either use "-skipcrccheck" or have checksum
> mismatching exception. Thus we may need to consider adding a new option to
> distcp so that it can automatically identify the original checksum type of
> each source file and use the same checksum type in the target FS.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)