[
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838988#comment-15838988
]
Aaron T. Myers commented on HADOOP-11794:
-----------------------------------------
Latest patch looks pretty good to me. Just a few small comments from me:
# "randomdize" -> "randomize": {{// When splitLargeFile is enabled, we don't
randomdize the copylist}}
# In two places you have basically "if (LOG.isDebugEnabled) { LOG.warn(...); }"
You should do {{LOG.debug(...)}} in these places, and perhaps also make these
debug messages a little more helpful instead of just "add1", which would
require someone to read the source code to understand.
# I think this log message is a little misleading:
{code}
+ CHUNK_SIZE("",
+ new Option("chunksize", true, "Size of chunk in number of blocks when " +
+ "splitting large files into chunks to copy in parallel")),
{code}
Assuming I'm reading the code correctly, the way a file is determined to be
"large" in this context is just if it has more blocks than the configured chunk
size. This log message also seems to imply that there might be some other
configuration option to enable/disable splitting large files at all. I think
better text would be something like "If set to a positive value, files with
more blocks than this value will be split at their block boundaries during
transfer, and reassembled on the destination cluster. By default, files will be
transmitted in their entirety without splitting."
# Rather than suppressing the checkstyle warnings, recommend implementing the
builder pattern for the {{CopyListingFileStatus}} constructors. That should
make things quite a bit clearer.
# There are a handful of lines that are changed that I think are just
whitespace, but not a big deal.
> distcp can copy blocks in parallel
> ----------------------------------
>
> Key: HADOOP-11794
> URL: https://issues.apache.org/jira/browse/HADOOP-11794
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Affects Versions: 0.21.0
> Reporter: dhruba borthakur
> Assignee: Yongjun Zhang
> Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch,
> MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are
> greater than 1 TB with a block size of 1 GB. If we use distcp to copy these
> files, the tasks either take a long long long time or finally fails. A better
> way for distcp would be to copy all the source blocks in parallel, and then
> stich the blocks back to files at the destination via the HDFS Concat API
> (HDFS-222)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]