[ 
https://issues.apache.org/jira/browse/MAHOUT-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911865#comment-13911865
 ] 

Yexi Jiang commented on MAHOUT-1426:
------------------------------------

I totally agree with you. From the algorithmic perspective, RBM and Autoencoder 
is proved to be very effective for feature learning. When training multi-level 
neural network, it is usually necessary to stack the RBMs or Autoencoders to 
learn the representative features first.

1. If the training dataset is large.
It is true that if the training data is huge, the online version be be slow as 
it is not a parallel implementation. If we implement the algorithm in MapReduce 
way, the data can be read in parallel. Now matter we use stochastic gradient 
descent, mini-batch gradient descent, or full batch gradient descent, we need 
to train the model with many iteration. In practice, we need one job for each 
iteration. It is know that the start-up time of hadoop is time-consuming, 
therefore, the overhead can be even higher than the actual computing time. For 
example, if we use stochastic gradient descent, after each partition read one 
data instance, we need to update and synchronize the model. IMHO, BSP is more 
effective than MapReduce in such scenario.

2. If the model is large.
If the model is large, we need to partition the model and store it 
distributedly, you can find a solution at a related NIPS paper 
(http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/large_deep_networks_nips2012.pdf).

In this case, the distributed system needs to be heterogeneous, since different 
nodes may have different tasks (for parameter storage or for computing). It is 
difficult to design an algorithm to conduct such work under MapReduce style, as 
each task is considered to be homogeneous in MapReduce. 

Actually, according to the talk of Tera-scale deep learning 
(http://static.googleusercontent.com/media/research.google.com/en/us/archive/unsupervised_learning_talk_2012.pdf),
 even BSP is not quite suitable since the error may always happen in a large 
scale distributed system. In their implementation, they implemented an 
asynchronous computing framework to conduct the large scale learning.

In summary, implementing MapReduce version of NeuralNetwork is OK, but compared 
with the more suitable computing frameworks, it is not so efficient.




> GSOC 2013 Neural network algorithms
> -----------------------------------
>
>                 Key: MAHOUT-1426
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1426
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Maciej Mazur
>
> I would like to ask about possibilites of implementing neural network 
> algorithms in mahout during GSOC.
> There is a classifier.mlp package with neural network.
> I can't see neighter RBM  nor Autoencoder in these classes.
> There is only one word about Autoencoders in NeuralNetwork class.
> As far as I know Mahout doesn't support convolutional networks.
> Is it a good idea to implement one of these algorithms?
> Is it a reasonable amount of work?
> How hard is it to get GSOC in Mahout?
> Did anyone succeed last year?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to