[jira] [Commented] (TEZ-2442) Support DFS based shuffle in addition to HTTP shuffle

Kannan Rajah (JIRA) Tue, 12 May 2015 11:51:45 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540463#comment-14540463
 ]


Kannan Rajah commented on TEZ-2442:
-----------------------------------

Thanks. Here is a link to my repo with commits (top 5) needed to get it working 
on 0.5.3. This is not intended for a pull request. I would like to gather 
feedback on the approach.
https://github.com/rkannan82/tez/commits/branch-0.5.3

The first commit defines the abstraction. In order to handle the backward 
compatibility case, I had actually implemented it inside Tez code base itself 
with same hadoop.fs package name. I also provided a TezLocalDirAllocator 
abstraction. Since this code belongs to hadoop-common, I filed JIRA-11905. I 
would like to get some inputs on how we can go forward.
https://github.com/rkannan82/tez/commit/27355844d303561a2db4174e1666646523c252fa

Then changes to existing Tez files are fairly trivial.
Before
LocalDirAllocator allocator = new LocalDirAllocator()
After
LocalDiskPathAllocator allocator = LocalDiskUtil.getPathAllocator()

Before
RawLocalFileSystem fs = new FileSystem.getLocal()
After
FileSystem fs = LocalDiskUtil.getFileSystem()

> Support DFS based shuffle in addition to HTTP shuffle
> -----------------------------------------------------
>
>                 Key: TEZ-2442
>                 URL: https://issues.apache.org/jira/browse/TEZ-2442
>             Project: Apache Tez
>          Issue Type: Improvement
>    Affects Versions: 0.5.3
>            Reporter: Kannan Rajah
>         Attachments: Tez Shuffle using DFS.pdf
>
>
> In Tez, Shuffle is a mechanism by which intermediate data can be shared 
> between stages. Shuffle data is written to local disk and fetched from any 
> remote node using HTTP. A DFS like MapR file system can support writing this 
> shuffle data directly to its DFS using a notion of local volumes and retrieve 
> it using HDFS API from remote node. The current Shuffle implementation 
> assumes local data can only be managed by LocalFileSystem. So it uses 
> RawLocalFileSystem and LocalDirAllocator. If we can remove this assumption 
> and introduce an abstraction to manage local disks, then we can reuse most of 
> the shuffle logic (store, sort) and inject a HDFS API based retrieval instead 
> of HTTP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2442) Support DFS based shuffle in addition to HTTP shuffle

Reply via email to