[jira] [Commented] (TEZ-2442) Support DFS based shuffle in addition to HTTP shuffle

Kannan Rajah (JIRA) Thu, 14 May 2015 11:24:58 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544148#comment-14544148
 ]


Kannan Rajah commented on TEZ-2442:
-----------------------------------

Separating the logical and physical parts make sense. But even after you do 
that, you would need this local disk abstraction at the physical layer, right? 
Because all that the new abstraction is doing is to determine which 
FileSystem/service can be used to read/write to local disks. So I can imagine 
the following cases:

Write                                                        Read
-------------------------------------------------------------------------------------------------------------------
LocalFileSystem/LocalDirAllocator         HTTP shuffle/LocalDirAllocator on 
remote node
LocalFileSystem/LocalDirAllocator         Aysnc shuffle/LocalDirAllocator on 
remote node
FileSystem/DFSLocalDirAllocator          FileSystem/DFSLocalDirAllocator

> Support DFS based shuffle in addition to HTTP shuffle
> -----------------------------------------------------
>
>                 Key: TEZ-2442
>                 URL: https://issues.apache.org/jira/browse/TEZ-2442
>             Project: Apache Tez
>          Issue Type: Improvement
>    Affects Versions: 0.5.3
>            Reporter: Kannan Rajah
>         Attachments: Tez Shuffle using DFS.pdf
>
>
> In Tez, Shuffle is a mechanism by which intermediate data can be shared 
> between stages. Shuffle data is written to local disk and fetched from any 
> remote node using HTTP. A DFS like MapR file system can support writing this 
> shuffle data directly to its DFS using a notion of local volumes and retrieve 
> it using HDFS API from remote node. The current Shuffle implementation 
> assumes local data can only be managed by LocalFileSystem. So it uses 
> RawLocalFileSystem and LocalDirAllocator. If we can remove this assumption 
> and introduce an abstraction to manage local disks, then we can reuse most of 
> the shuffle logic (store, sort) and inject a HDFS API based retrieval instead 
> of HTTP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2442) Support DFS based shuffle in addition to HTTP shuffle

Reply via email to