[
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847403#comment-15847403
]
Mithun Radhakrishnan commented on HADOOP-11794:
-----------------------------------------------
Wow, this is really good work. (I'm continually astonished at how much DistCp
has been improved upon and added to.)
Please forgive me, my DistCp-ese is a little rusty. I have a couple of minor
questions:
# In {{DistCpUtils::toCopyListingFileStatus()}}, the javadoc says it
{{"Converts a list of FileStatus to a list CopyListingFileStatus"}}. The method
does not take a {{List<FileStatus>}}. Shall we remove {{"list of"}}?
# Could we rephrase the doc to {{"Converts a `FileStatus` a list of
`CopyListingFileStatus`. Returns either one CopyListingFileStatus per chunk of
file-blocks (if file-size exceeds chunk-size), or one CopyListingFileStatus for
the entire file (if file-size is too small to split)."}}?
# {{DistCpUtils::toCopyListingFileStatus()}} handles heterogeneous block-sizes
via {{DFSClient.getBlockLocations()}}, but only if {{fileStatus.getLen() >
fileStatus.getBlockSize()*chunkSize}}. Is it possible for an HDFS file with
{{fileStatus.getBlockSize() == 256M}} to be composed entirely of tiny blocks
(say 32MB)? Could we have a situation where a splittable file (with small
blocks) ends up unsplit, because {{fileStatus.getBlockSize() >>
effectiveBlockSize}}?
# I wonder if {{chunksize}} might be confused to be the "chunk-length in bytes"
(like {{CopyListingFileStatus.chunkLength}}). I could be wrong, but would
{{blocksPerChunk}} be less ambiguous? (Please ignore if this is too pervasive.)
# Nitpick: {{CopyListingFileStatus.toString()}} uses String concatenation
inside a call to {{StringBuilder.apend()}}. (It was that way well before this
patch. :/) Shall we replace this with a chain of {{.append()}} calls?
# In {{CopyCommitter::concatFileChunks()}}, could we please add additional
logging for what files/chunks are being merged?
Thanks so much for working on this, [~yzhangal]. :]
> distcp can copy blocks in parallel
> ----------------------------------
>
> Key: HADOOP-11794
> URL: https://issues.apache.org/jira/browse/HADOOP-11794
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Affects Versions: 0.21.0
> Reporter: dhruba borthakur
> Assignee: Yongjun Zhang
> Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch,
> HADOOP-11794.003.patch, MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are
> greater than 1 TB with a block size of 1 GB. If we use distcp to copy these
> files, the tasks either take a long long long time or finally fails. A better
> way for distcp would be to copy all the source blocks in parallel, and then
> stich the blocks back to files at the destination via the HDFS Concat API
> (HDFS-222)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]