[
https://issues.apache.org/jira/browse/HDFS-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16627181#comment-16627181
]
Steve Loughran commented on HDFS-3889:
--------------------------------------
This is a real problem. but it's too late to fix because too many workflows use
distcp without the -skipCrcCheck option against stores which don't do
checksums, to the extent that if you add checksums to an FS, people's backups
break (HADOOP-15297).
If this were to be done, it'd have to be through some new checksum option,
something like -checksums "skip", "enable", "strict", "ignore-type-mismatch',
'metadata' etc.
the strict one would be the strictest checks possible; 'metadata' the metadata,
though there I think it'd be hard pressed to work reliably.
> distcp overwrites files even when there are missing checksums
> -------------------------------------------------------------
>
> Key: HDFS-3889
> URL: https://issues.apache.org/jira/browse/HDFS-3889
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: tools
> Affects Versions: 2.0.2-alpha
> Reporter: Colin P. McCabe
> Priority: Minor
>
> If distcp can't read the checksum files for the source and destination
> files-- for any reason-- it ignores the checksums and overwrites the
> destination file. It does produce a log message, but I think the correct
> behavior would be to throw an error and stop the distcp.
> If the user really wants to ignore checksums, he or she can use
> {{-skipcrccheck}} to do so.
> The relevant code is in DistCpUtils#checksumsAreEquals:
> {code}
> try {
> sourceChecksum = sourceFS.getFileChecksum(source);
> targetChecksum = targetFS.getFileChecksum(target);
> } catch (IOException e) {
> LOG.error("Unable to retrieve checksum for " + source + " or " +
> target, e);
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]