[ 
https://issues.apache.org/jira/browse/HADOOP-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952131#comment-15952131
 ] 

Manu Zhang commented on HADOOP-13944:
-------------------------------------

Hi all,

Here is an umbrella GitHub repo for the overall architecture, considerations 
and rational.
It also contains links to a group of sub-projects, each of which is to support 
a deep learning engine on YARN (e.g. TensorFlowOnYARN, MXNetOnYARN)

[https://github.com/Intel-bigdata/HDL]

> [HDL] Support Deep Learning on Hadoop
> -------------------------------------
>
>                 Key: HADOOP-13944
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13944
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Kai Zheng
>
> Big data empowers Deep Learning (DL) and Hadoop is a natural platform to 
> support this new computation as, of enormous data (HDFS) and vast CPU 
> resources (YARN). Supporting Deep Learning in Hadoop platform layer has its 
> particular advantages: it would be much easier to achieve the desired data 
> affinity and hardware specific schedule, and it will also be flexible to 
> support above computing and user facing frameworks such as Spark, Hive, Flink 
> and Streams.
> We’d like to propose to evolve Hadoop further embracing Deep Learning and 
> provide the fundamental infrastructure to support the new computing. Briefly, 
> the goals would be:
> * A new layer in Hadoop for launching, distributing and executing Deep 
> Learning workloads like for MapReduce;
> * A framework in the new layer to leverage and support existing Deep Learning 
> engines such as Tensorflow, Caffe/Intel-Caffe, mxnet, Nevana and etc.;
> * Extend and enhance YARN to support the desired scheduling capabilities, 
> like already raised in the community, for FPGA, GPU and etc.;
> * Optimize HDFS storage and provide desired data formats for Deep Learning;
> * Tools and libraries to submit and manage DL jobs, necessary web UIs for the 
> monitoring and troubleshooting;
> * Optionally, for the long term, a common Deep Learning domain representation 
> for users to define DL jobs independent of concrete DL engines. 
> Out of scope: new Deep Learning engine. We leverage and support existing DL 
> engines, also allowing users to hook their owns.
> The rational:
> * Deep Learning is data and IO heavy, related advantages in HDFS and Hadoop: 
> of vast data to learn from, already existing or easy loading into; data 
> locality, still desired in DL; tiered storage support, to use faster devices 
> like NVMe SSD, 3D XPoint and persistent memory; cache support, to use large 
> memory for hot or repeatedly accessed data; even Ozone, the KV store for 
> amounts of small objects and the desired API; and the cloud support.
> * Deep Learning is computing heavy, related advantages in YARN: flexible, to 
> support complex computing frameworks and applications; hardware capability 
> aware, accordingly scheduling and distributing, thinking about FPGA, GPU and 
> RDMA; large scale, proven scalability supporting thousands of nodes; nice 
> facilities such as timeline service and richful interfaces (cmds, restful and 
> web).
> * As a common and low level facility layer, easier to optimize in bottom, yet 
> powerful to support above frameworks, such as Spark, Flink, Hive and Streams. 
> Don’t need to hack everywhere, but in a central place and common layer.
> * Security, enterprise and distribution. A mature ecosystem for Deep Learning 
> to build upon.
> This is based on our survey and some preliminary work like Tensorflow on YARN 
> (will document and discuss it separately under this umbrella). We welcome 
> your feedback and valuable thoughts. When aligned, we’d like to contribute 
> our work in Hadoop project space (maybe a new module like 
> hadoop-deeplearning, similar to the cloud supports, in a separate branch) 
> since from our point of view, the work can benefit more Hadoop users other 
> than just in a Github repo.
> Filing this unassigned, as it’s a team work for now, and hopefully, will be a 
> community effort.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to