[jira] (HADOOP-11794) distcp can copy blocks in parallel

Mithun Radhakrishnan (JIRA) Tue, 31 Jan 2017 11:57:47 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847403#comment-15847403
 ]


Mithun Radhakrishnan commented on HADOOP-11794:
-----------------------------------------------

Wow, this is really good work. (I'm continually astonished at how much DistCp 
has been improved upon and added to.)
Please forgive me, my DistCp-ese is a little rusty. I have a couple of minor 
questions:
# In {{DistCpUtils::toCopyListingFileStatus()}}, the javadoc says it 
{{"Converts a list of FileStatus to a list CopyListingFileStatus"}}. The method 
does not take a {{List<FileStatus>}}. Shall we remove {{"list of"}}?
# Could we rephrase the doc to {{"Converts a `FileStatus` a list of 
`CopyListingFileStatus`. Returns either one CopyListingFileStatus per chunk of 
file-blocks (if file-size exceeds chunk-size), or one CopyListingFileStatus for 
the entire file (if file-size is too small to split)."}}?
# {{DistCpUtils::toCopyListingFileStatus()}} handles heterogeneous block-sizes 
via {{DFSClient.getBlockLocations()}}, but only if {{fileStatus.getLen() > 
fileStatus.getBlockSize()*chunkSize}}. Is it possible for an HDFS file with 
{{fileStatus.getBlockSize() == 256M}} to be composed entirely of tiny blocks 
(say 32MB)? Could we have a situation where a splittable file (with small 
blocks) ends up unsplit, because {{fileStatus.getBlockSize() >> 
effectiveBlockSize}}?
# I wonder if {{chunksize}} might be confused to be the "chunk-length in bytes" 
(like {{CopyListingFileStatus.chunkLength}}). I could be wrong, but would 
{{blocksPerChunk}} be less ambiguous? (Please ignore if this is too pervasive.)
# Nitpick: {{CopyListingFileStatus.toString()}} uses String concatenation 
inside a call to {{StringBuilder.apend()}}. (It was that way well before this 
patch. :/) Shall we replace this with a chain of {{.append()}} calls?
# In {{CopyCommitter::concatFileChunks()}}, could we please add additional 
logging for what files/chunks are being merged?

Thanks so much for working on this, [~yzhangal]. :]

> distcp can copy blocks in parallel
> ----------------------------------
>
>                 Key: HADOOP-11794
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11794
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: dhruba borthakur
>            Assignee: Yongjun Zhang
>         Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, 
> HADOOP-11794.003.patch, MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] (HADOOP-11794) distcp can copy blocks in parallel

Reply via email to