[ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190320#comment-14190320
 ] 

Mithun Radhakrishnan commented on HADOOP-8143:
----------------------------------------------

Hh, actually, thank you for correcting me, Allen. You're right, the 
"data-corruption" part is a clumsily worded overstatement, and not 
ORC-specific. 

1. Checksum-verifications between source and target are guaranteed to fail 
between files with identical contents, but different block-sizes (and span 
blocks). If HDFS has been working to fix this, do let me know of the JIRA. The 
only way to have DistCp succeed in copying them is to skip checksums. And this 
raises the potential for bad copies of the file, regardless of format.

2. There's potential for performance degradation when ORC files with large 
stripes are copied to clusters with smaller block-sizes, if block-sizes aren't 
preserved.

While #2 is of some concern, #1 is of maximum import. 

> Change distcp to have -pb on by default
> ---------------------------------------
>
>                 Key: HADOOP-8143
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8143
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Dave Thompson
>            Assignee: Mithun Radhakrishnan
>            Priority: Minor
>         Attachments: HADOOP-8143.1.patch
>
>
> We should have the preserve blocksize (-pb) on in distcp by default.        
> checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to