[ 
https://issues.apache.org/jira/browse/HADOOP-13975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated HADOOP-13975:
--------------------------------
    Description: 
Although distcp allow users to control the parallelism via number of mappers, 
sometimes it's desirable to run fewer mappers but more threads per mapper.  
Since distcp is network bound (either by throughput or more frequently by 
latency of creating connections, opening files, reading/writing files, and 
closing files), this can make each mapper much more efficient.  When WebHDFS 
protocol is used either as source or target, this MultiThreaded approach can 
make the HTTP connection reuse (to the NameNode) more efficient as well.

In that way, a lot of resources can be shared so we can save memory and 
connections to NameNode.


  was:
Although distcp allow users to control the parallelism via number of mappers, 
sometimes it's desirable to run fewer mappers but more threads per mapper.  
Since distcp is network bound (either by throughput or more frequently by 
latency of creating connections, opening files, reading/writing files, and 
closing files), this can make each mapper much more efficient.

In that way, a lot of resources can be shared so we can save memory and 
connections to NameNode.



> Allow DistCp to use MultiThreadedMapper
> ---------------------------------------
>
>                 Key: HADOOP-13975
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13975
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: tools/distcp
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Minor
>         Attachments: HADOOP-distcp-multithreaded-mapper-branch26.1.patch, 
> HADOOP-distcp-multithreaded-mapper-branch26.2.patch, 
> HADOOP-distcp-multithreaded-mapper-branch26.3.patch, 
> HADOOP-distcp-multithreaded-mapper-branch26.4.patch, 
> HADOOP-distcp-multithreaded-mapper-trunk.1.patch, 
> HADOOP-distcp-multithreaded-mapper-trunk.2.patch, 
> HADOOP-distcp-multithreaded-mapper-trunk.3.patch, 
> HADOOP-distcp-multithreaded-mapper-trunk.4.patch
>
>
> Although distcp allow users to control the parallelism via number of mappers, 
> sometimes it's desirable to run fewer mappers but more threads per mapper.  
> Since distcp is network bound (either by throughput or more frequently by 
> latency of creating connections, opening files, reading/writing files, and 
> closing files), this can make each mapper much more efficient.  When WebHDFS 
> protocol is used either as source or target, this MultiThreaded approach can 
> make the HTTP connection reuse (to the NameNode) more efficient as well.
> In that way, a lot of resources can be shared so we can save memory and 
> connections to NameNode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to