[
https://issues.apache.org/jira/browse/TEZ-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962723#comment-16962723
]
Ganesha Shreedhara commented on TEZ-2442:
-----------------------------------------
Hi,
Is there a reason why we stopped working on this? It would be a nice feature to
persist shuffle data in a remote storage to avoid some of the shortcomings with
the current implementation.
Are we fine with having an abstraction for only file system based storage
(HDFS, NFS, LustreFS or any cloud based file storage)? or Do we want to make it
more generic similar to the ongoing work in spark (
https://issues.apache.org/jira/browse/SPARK-25299) ? I am interested to
contribute to this work.
> Support DFS based shuffle in addition to HTTP shuffle
> -----------------------------------------------------
>
> Key: TEZ-2442
> URL: https://issues.apache.org/jira/browse/TEZ-2442
> Project: Apache Tez
> Issue Type: Improvement
> Affects Versions: 0.5.3
> Reporter: Kannan Rajah
> Assignee: shanyu zhao
> Priority: Major
> Attachments: FS_based_shuffle_v2.pdf, Tez Shuffle using DFS.pdf,
> hdfs_broadcast_hack.txt, tez-2442-trunk.2.patch, tez-2442-trunk.3.patch,
> tez-2442-trunk.4.patch, tez-2442-trunk.5.patch, tez-2442-trunk.patch,
> tez_hdfs_shuffle.patch
>
>
> In Tez, Shuffle is a mechanism by which intermediate data can be shared
> between stages. Shuffle data is written to local disk and fetched from any
> remote node using HTTP. A DFS like MapR file system can support writing this
> shuffle data directly to its DFS using a notion of local volumes and retrieve
> it using HDFS API from remote node. The current Shuffle implementation
> assumes local data can only be managed by LocalFileSystem. So it uses
> RawLocalFileSystem and LocalDirAllocator. If we can remove this assumption
> and introduce an abstraction to manage local disks, then we can reuse most of
> the shuffle logic (store, sort) and inject a HDFS API based retrieval instead
> of HTTP.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)