[ 
https://issues.apache.org/jira/browse/TEZ-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544122#comment-14544122
 ] 

Bikas Saha commented on TEZ-2442:
---------------------------------

These Inputs and Outputs are a combination of logical and physical parts. The 
logical part does the partitioning of output and the merging of inputs. The 
physical part currently supports local disk write and http based fetch. There 
could be multiple ways to do http based fetch and TEZ-2450 is proposing to use 
a different library. Similarly, this jira is proposing to use HDFS API based 
fetch. This jira is also proposing a different way to write instead of using 
local disk. 

I am wondering if a clean approach would involve clearly separating the logical 
and physical parts. The logical part would be common to all implementations but 
the physical part could be configured. So DAGs would continue to refer to these 
Inputs/Outputs but the job configuration would determine their actual physical 
parts. I suspect that different projects/organizations would want to use their 
custom physical pieces based on their custom technology. E.g. in this case 
using features in MapR-FS. Making these clean abstractions would ease these 
customizations a lot. This would also allow custom physical code to be used 
without having to change Tez/Hadoop code or add new dependencies to the 
projects (not that your approach is doing it but future customizations might). 
This would also enable you to use specific MapR-FS libraries directly instead 
of your HDFS compatibility abstraction, in case that makes things better for 
you.

Thinking further even the logical part could be made configurable because there 
may be multiple ways to merge inputs and we should be able to choose them 
without having to change the DAG definition in user code.

Thoughts?


> Support DFS based shuffle in addition to HTTP shuffle
> -----------------------------------------------------
>
>                 Key: TEZ-2442
>                 URL: https://issues.apache.org/jira/browse/TEZ-2442
>             Project: Apache Tez
>          Issue Type: Improvement
>    Affects Versions: 0.5.3
>            Reporter: Kannan Rajah
>         Attachments: Tez Shuffle using DFS.pdf
>
>
> In Tez, Shuffle is a mechanism by which intermediate data can be shared 
> between stages. Shuffle data is written to local disk and fetched from any 
> remote node using HTTP. A DFS like MapR file system can support writing this 
> shuffle data directly to its DFS using a notion of local volumes and retrieve 
> it using HDFS API from remote node. The current Shuffle implementation 
> assumes local data can only be managed by LocalFileSystem. So it uses 
> RawLocalFileSystem and LocalDirAllocator. If we can remove this assumption 
> and introduce an abstraction to manage local disks, then we can reuse most of 
> the shuffle logic (store, sort) and inject a HDFS API based retrieval instead 
> of HTTP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to