[
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190320#comment-14190320
]
Mithun Radhakrishnan commented on HADOOP-8143:
----------------------------------------------
Hh, actually, thank you for correcting me, Allen. You're right, the
"data-corruption" part is a clumsily worded overstatement, and not
ORC-specific.
1. Checksum-verifications between source and target are guaranteed to fail
between files with identical contents, but different block-sizes (and span
blocks). If HDFS has been working to fix this, do let me know of the JIRA. The
only way to have DistCp succeed in copying them is to skip checksums. And this
raises the potential for bad copies of the file, regardless of format.
2. There's potential for performance degradation when ORC files with large
stripes are copied to clusters with smaller block-sizes, if block-sizes aren't
preserved.
While #2 is of some concern, #1 is of maximum import.
> Change distcp to have -pb on by default
> ---------------------------------------
>
> Key: HADOOP-8143
> URL: https://issues.apache.org/jira/browse/HADOOP-8143
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: Dave Thompson
> Assignee: Mithun Radhakrishnan
> Priority: Minor
> Attachments: HADOOP-8143.1.patch
>
>
> We should have the preserve blocksize (-pb) on in distcp by default.
> checksum which is on by default will always fail if blocksize is not the same.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)