[
https://issues.apache.org/jira/browse/TEZ-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540463#comment-14540463
]
Kannan Rajah commented on TEZ-2442:
-----------------------------------
Thanks. Here is a link to my repo with commits (top 5) needed to get it working
on 0.5.3. This is not intended for a pull request. I would like to gather
feedback on the approach.
https://github.com/rkannan82/tez/commits/branch-0.5.3
The first commit defines the abstraction. In order to handle the backward
compatibility case, I had actually implemented it inside Tez code base itself
with same hadoop.fs package name. I also provided a TezLocalDirAllocator
abstraction. Since this code belongs to hadoop-common, I filed JIRA-11905. I
would like to get some inputs on how we can go forward.
https://github.com/rkannan82/tez/commit/27355844d303561a2db4174e1666646523c252fa
Then changes to existing Tez files are fairly trivial.
Before
LocalDirAllocator allocator = new LocalDirAllocator()
After
LocalDiskPathAllocator allocator = LocalDiskUtil.getPathAllocator()
Before
RawLocalFileSystem fs = new FileSystem.getLocal()
After
FileSystem fs = LocalDiskUtil.getFileSystem()
> Support DFS based shuffle in addition to HTTP shuffle
> -----------------------------------------------------
>
> Key: TEZ-2442
> URL: https://issues.apache.org/jira/browse/TEZ-2442
> Project: Apache Tez
> Issue Type: Improvement
> Affects Versions: 0.5.3
> Reporter: Kannan Rajah
> Attachments: Tez Shuffle using DFS.pdf
>
>
> In Tez, Shuffle is a mechanism by which intermediate data can be shared
> between stages. Shuffle data is written to local disk and fetched from any
> remote node using HTTP. A DFS like MapR file system can support writing this
> shuffle data directly to its DFS using a notion of local volumes and retrieve
> it using HDFS API from remote node. The current Shuffle implementation
> assumes local data can only be managed by LocalFileSystem. So it uses
> RawLocalFileSystem and LocalDirAllocator. If we can remove this assumption
> and introduce an abstraction to manage local disks, then we can reuse most of
> the shuffle logic (store, sort) and inject a HDFS API based retrieval instead
> of HTTP.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)