[
https://issues.apache.org/jira/browse/TEZ-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544195#comment-14544195
]
Kannan Rajah commented on TEZ-2442:
-----------------------------------
If LocalDirAllocator is used by logical operations like spill and merge, then
we want to be able to reuse all that logic because all that is still applicable
for HDFS based shuffle. I would like to clarify one thing. It seems like your
proposal is trying to accommodate some new use cases that might come in future.
This is a good thing. But this JIRA is just about being able to reuse all the
existing shuffle logic for HDFS mode. By trying to support a more generic use
case, I don't want to lose the ability to reuse code. Does that make sense?
> Support DFS based shuffle in addition to HTTP shuffle
> -----------------------------------------------------
>
> Key: TEZ-2442
> URL: https://issues.apache.org/jira/browse/TEZ-2442
> Project: Apache Tez
> Issue Type: Improvement
> Affects Versions: 0.5.3
> Reporter: Kannan Rajah
> Attachments: Tez Shuffle using DFS.pdf
>
>
> In Tez, Shuffle is a mechanism by which intermediate data can be shared
> between stages. Shuffle data is written to local disk and fetched from any
> remote node using HTTP. A DFS like MapR file system can support writing this
> shuffle data directly to its DFS using a notion of local volumes and retrieve
> it using HDFS API from remote node. The current Shuffle implementation
> assumes local data can only be managed by LocalFileSystem. So it uses
> RawLocalFileSystem and LocalDirAllocator. If we can remove this assumption
> and introduce an abstraction to manage local disks, then we can reuse most of
> the shuffle logic (store, sort) and inject a HDFS API based retrieval instead
> of HTTP.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)