[ 
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838988#comment-15838988
 ] 

Aaron T. Myers commented on HADOOP-11794:
-----------------------------------------

Latest patch looks pretty good to me. Just a few small comments from me:

# "randomdize" -> "randomize": {{// When splitLargeFile is enabled, we don't 
randomdize the copylist}}
# In two places you have basically "if (LOG.isDebugEnabled) { LOG.warn(...); }" 
You should do {{LOG.debug(...)}} in these places, and perhaps also make these 
debug messages a little more helpful instead of just "add1", which would 
require someone to read the source code to understand.
# I think this log message is a little misleading:
{code}
+  CHUNK_SIZE("",
+      new Option("chunksize", true, "Size of chunk in number of blocks when " +
+          "splitting large files into chunks to copy in parallel")),
{code}
Assuming I'm reading the code correctly, the way a file is determined to be 
"large" in this context is just if it has more blocks than the configured chunk 
size. This log message also seems to imply that there might be some other 
configuration option to enable/disable splitting large files at all. I think 
better text would be something like "If set to a positive value, files with 
more blocks than this value will be split at their block boundaries during 
transfer, and reassembled on the destination cluster. By default, files will be 
transmitted in their entirety without splitting."
# Rather than suppressing the checkstyle warnings, recommend implementing the 
builder pattern for the {{CopyListingFileStatus}} constructors. That should 
make things quite a bit clearer.
# There are a handful of lines that are changed that I think are just 
whitespace, but not a big deal.

> distcp can copy blocks in parallel
> ----------------------------------
>
>                 Key: HADOOP-11794
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11794
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: dhruba borthakur
>            Assignee: Yongjun Zhang
>         Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, 
> MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to