[
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15485131#comment-15485131
]
Alessio commented on SPARK-5575:
--------------------------------
Pretty strange that this post with such hype is still "In progress" after 1
year.
If Apache Spark does not (want to?) include your ANNs, can you consider
releasing it as an independent toolbox?
> Artificial neural networks for MLlib deep learning
> --------------------------------------------------
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
> Issue Type: Umbrella
> Components: MLlib
> Affects Versions: 1.2.0
> Reporter: Alexander Ulanov
>
> *Goal:* Implement various types of artificial neural networks
> *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
> Having deep learning within Spark's ML library is a question of convenience.
> Spark has broad analytic capabilities and it is useful to have deep learning
> as one of these tools at hand. Deep learning is a model of choice for several
> important modern use-cases, and Spark ML might want to cover them.
> Eventually, it is hard to explain, why do we have PCA in ML but don't provide
> Autoencoder. To summarize this, Spark should have at least the most widely
> used deep learning models, such as fully connected artificial neural network,
> convolutional network and autoencoder. Advanced and experimental deep
> learning features might reside within packages or as pluggable external
> tools. These 3 will provide a comprehensive deep learning set for Spark ML.
> We might also include recurrent networks as well.
> *Requirements:*
> # Extensible API compatible with Spark ML. Basic abstractions such as Neuron,
> Layer, Error, Regularization, Forward and Backpropagation etc. should be
> implemented as traits or interfaces, so they can be easily extended or
> reused. Define the Spark ML API for deep learning. This interface is similar
> to the other analytics tools in Spark and supports ML pipelines. This makes
> deep learning easy to use and plug in into analytics workloads for Spark
> users.
> # Efficiency. The current implementation of multilayer perceptron in Spark is
> less than 2x slower than Caffe, both measured on CPU. The main overhead
> sources are JVM and Spark's communication layer. For more details, please
> refer to https://github.com/avulanov/ann-benchmark. Having said that, the
> efficient implementation of deep learning in Spark should be only few times
> slower than in specialized tool. This is very reasonable for the platform
> that does much more than deep learning and I believe it is understood by the
> community.
> # Scalability. Implement efficient distributed training. It relies heavily on
> the efficient communication and scheduling mechanisms. The default
> implementation is based on Spark. More efficient implementations might
> include some external libraries but use the same interface defined.
> *Main features:*
> # Multilayer perceptron classifier (MLP)
> # Autoencoder
> # Convolutional neural networks for computer vision. The interface has to
> provide few architectures for deep learning that are widely used in practice,
> such as AlexNet
> *Additional features:*
> # Other architectures, such as Recurrent neural network (RNN), Long-short
> term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network
> (DBN), MLP multivariate regression
> # Regularizers, such as L1, L2, drop-out
> # Normalizers
> # Network customization. The internal API of Spark ANN is designed to be
> flexible and can handle different types of layers. However, only a part of
> the API is made public. We have to limit the number of public classes in
> order to make it simpler to support other languages. This forces us to use
> (String or Number) parameters instead of introducing of new public classes.
> One of the options to specify the architecture of ANN is to use text
> configuration with layer-wise description. We have considered using Caffe
> format for this. It gives the benefit of compatibility with well known deep
> learning tool and simplifies the support of other languages in Spark.
> Implementation of a parser for the subset of Caffe format might be the first
> step towards the support of general ANN architectures in Spark.
> # Hardware specific optimization. One can wrap other deep learning
> implementations with this interface allowing users to pick a particular
> back-end, e.g. Caffe or TensorFlow, along with the default one. The interface
> has to provide few architectures for deep learning that are widely used in
> practice, such as AlexNet. The main motivation for using specialized
> libraries for deep learning would be to fully take advantage of the hardware
> where Spark runs, in particular GPUs. Having the default interface in Spark,
> we will need to wrap only a subset of functions from a given specialized
> library. It does require an effort, however it is not the same as wrapping
> all functions. Wrappers can be provided as packages without the need to pull
> new dependencies into Spark.
> *Completed (merged to the main Spark branch):*
> * Requirements: https://issues.apache.org/jira/browse/SPARK-9471
> ** API
> https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/
> ** Efficiency & Scalability: https://github.com/avulanov/ann-benchmark
> * Features:
> ** Multilayer perceptron classifier
> https://issues.apache.org/jira/browse/SPARK-9471
> *In progress (pull request):*
> * Features:
> ** Autoencoder https://issues.apache.org/jira/browse/SPARK-2623
> * Additional features:
> ** MLP regression: https://issues.apache.org/jira/browse/SPARK-10409
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]