[
https://issues.apache.org/jira/browse/HADOOP-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952131#comment-15952131
]
Manu Zhang commented on HADOOP-13944:
-------------------------------------
Hi all,
Here is an umbrella GitHub repo for the overall architecture, considerations
and rational.
It also contains links to a group of sub-projects, each of which is to support
a deep learning engine on YARN (e.g. TensorFlowOnYARN, MXNetOnYARN)
[https://github.com/Intel-bigdata/HDL]
> [HDL] Support Deep Learning on Hadoop
> -------------------------------------
>
> Key: HADOOP-13944
> URL: https://issues.apache.org/jira/browse/HADOOP-13944
> Project: Hadoop Common
> Issue Type: New Feature
> Reporter: Kai Zheng
>
> Big data empowers Deep Learning (DL) and Hadoop is a natural platform to
> support this new computation as, of enormous data (HDFS) and vast CPU
> resources (YARN). Supporting Deep Learning in Hadoop platform layer has its
> particular advantages: it would be much easier to achieve the desired data
> affinity and hardware specific schedule, and it will also be flexible to
> support above computing and user facing frameworks such as Spark, Hive, Flink
> and Streams.
> We’d like to propose to evolve Hadoop further embracing Deep Learning and
> provide the fundamental infrastructure to support the new computing. Briefly,
> the goals would be:
> * A new layer in Hadoop for launching, distributing and executing Deep
> Learning workloads like for MapReduce;
> * A framework in the new layer to leverage and support existing Deep Learning
> engines such as Tensorflow, Caffe/Intel-Caffe, mxnet, Nevana and etc.;
> * Extend and enhance YARN to support the desired scheduling capabilities,
> like already raised in the community, for FPGA, GPU and etc.;
> * Optimize HDFS storage and provide desired data formats for Deep Learning;
> * Tools and libraries to submit and manage DL jobs, necessary web UIs for the
> monitoring and troubleshooting;
> * Optionally, for the long term, a common Deep Learning domain representation
> for users to define DL jobs independent of concrete DL engines.
> Out of scope: new Deep Learning engine. We leverage and support existing DL
> engines, also allowing users to hook their owns.
> The rational:
> * Deep Learning is data and IO heavy, related advantages in HDFS and Hadoop:
> of vast data to learn from, already existing or easy loading into; data
> locality, still desired in DL; tiered storage support, to use faster devices
> like NVMe SSD, 3D XPoint and persistent memory; cache support, to use large
> memory for hot or repeatedly accessed data; even Ozone, the KV store for
> amounts of small objects and the desired API; and the cloud support.
> * Deep Learning is computing heavy, related advantages in YARN: flexible, to
> support complex computing frameworks and applications; hardware capability
> aware, accordingly scheduling and distributing, thinking about FPGA, GPU and
> RDMA; large scale, proven scalability supporting thousands of nodes; nice
> facilities such as timeline service and richful interfaces (cmds, restful and
> web).
> * As a common and low level facility layer, easier to optimize in bottom, yet
> powerful to support above frameworks, such as Spark, Flink, Hive and Streams.
> Don’t need to hack everywhere, but in a central place and common layer.
> * Security, enterprise and distribution. A mature ecosystem for Deep Learning
> to build upon.
> This is based on our survey and some preliminary work like Tensorflow on YARN
> (will document and discuss it separately under this umbrella). We welcome
> your feedback and valuable thoughts. When aligned, we’d like to contribute
> our work in Hadoop project space (maybe a new module like
> hadoop-deeplearning, similar to the cloud supports, in a separate branch)
> since from our point of view, the work can benefit more Hadoop users other
> than just in a Github repo.
> Filing this unassigned, as it’s a team work for now, and hopefully, will be a
> community effort.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]