[ 
https://issues.apache.org/jira/browse/TEZ-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544535#comment-14544535
 ] 

Bikas Saha commented on TEZ-2442:
---------------------------------

bq. On both the producer and consumer end - there may be intermediate data 
being written out to disk. For the producer, this may be intermediate spills, 
which don't need to be transferred to the reducer and instead will be merged 
locally. For the consumer - there's data which is fetched, but there may be 
intermediate spills while this data is being merged together. I don't think the 
intermediate spills need to go through HDFS.
I think thats is what Kannan wants to do though, per his previous comments. 
intermediate spills also go to maprfs,

> Support DFS based shuffle in addition to HTTP shuffle
> -----------------------------------------------------
>
>                 Key: TEZ-2442
>                 URL: https://issues.apache.org/jira/browse/TEZ-2442
>             Project: Apache Tez
>          Issue Type: Improvement
>    Affects Versions: 0.5.3
>            Reporter: Kannan Rajah
>            Assignee: Kannan Rajah
>         Attachments: Tez Shuffle using DFS.pdf
>
>
> In Tez, Shuffle is a mechanism by which intermediate data can be shared 
> between stages. Shuffle data is written to local disk and fetched from any 
> remote node using HTTP. A DFS like MapR file system can support writing this 
> shuffle data directly to its DFS using a notion of local volumes and retrieve 
> it using HDFS API from remote node. The current Shuffle implementation 
> assumes local data can only be managed by LocalFileSystem. So it uses 
> RawLocalFileSystem and LocalDirAllocator. If we can remove this assumption 
> and introduce an abstraction to manage local disks, then we can reuse most of 
> the shuffle logic (store, sort) and inject a HDFS API based retrieval instead 
> of HTTP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to