I really think that a true downpour architecture is actually easier than what you suggest and much better for the purpose.
On Wed, Mar 19, 2014 at 1:28 PM, Maciej Mazur <[email protected]>wrote: > Any comments? > I think it will work. If I will do one long lasting job, hack the file > system from mapper in order to repeateadly update weights, perform mini > batch GD, and store updates in some folder. > In the background I could call small jobs for gathering gradients and > updating weights. > > > On Tue, Mar 18, 2014 at 10:11 PM, Maciej Mazur <[email protected] > >wrote: > > > I'll say what I think about it. > > > > I know that mahout is currently heading in different direction. You are > > working on refactoring, improving existing api and migrating to Spark. I > > know that there is a great deal of work to do there. I would also like to > > help with that. > > > > I am impressed by results achieved by using Neural Networks. Generally > > speaking I think that NN give significant advantage over other methods in > > wide range of problems. It beats other state of the art algorithms in > > various areas. I think that in the future this algorithm will play even > > greater role. > > That's why I came up with an idea to implement neural networks. > > > > When it comes to functionality: pretraining (RBM), training > (SGD/minibatch > > gradient descent + backpropagation + momentum) and classification. > > > > Unfortunately mapreduce is illsuited for NNs. > > The biggest problem is how to reduce the number of iterations. > > It is possible to divide data and use momentum applied to edges - it > helps > > a little, but doesn't solve the problem. > > > > I've some idea of not exactly mapreduce implementation. But I am not sure > > whether it is possible using this infrastructure. For sure it is not > plain > > map reduce. > > In other distributed NNs implementation there are asynchronic operations. > > Is it possible to take adventage of asynchrony? > > At first I would separate data, some subset on every node. > > On each node I will use a number of files (directories) for storing > > weights. > > Each machile will use these files to count the cost function and update > > gradient. > > In the background multiple reduce job will average gradients for some > > subsets of weights (one file). > > Then asynchronously update some subset of weights (from one file). > > In a way this idea is similar to Downpour SGD from > > > http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/large_deep_networks_nips2012.pdf > > > > There are couple of problems here. Is it a feasible solution? > > > > > > Parallel implementation is very complex. It's hard to design something > > that uses mapreduce, but it's not a mapreduce algorithm. > > Definitely you are more experienced than me and I'll need a lot of help. > I > > may not be aware of some limitations. > > > > From my perspective it would be a great experience, even if I could do > > something other than NNs. Frankly speaking I think I'll stay here > > regardless of whether my propasal will be accepted. It'll be a great > > opportunity to learn. > > > > > > > > > > On Mon, Mar 17, 2014 at 5:27 AM, Suneel Marthi <[email protected] > >wrote: > > > >> I would suggest looking at deeplearning4j.org (they went public very > >> recently) and see how they had utilized Iterative Reduce for > implementing > >> Neural Nets. > >> > >> Not sure given the present state of flux on the project if we should > even > >> be considering adding any new algorithms. The existing ones can be > >> refactored to be more API driven (for both clustering and > classification) > >> and that's no trivial effort and could definitely use lot of help. > >> > >> How is what u r proposing gonna be any better than similar existing > >> implementations that Mahout > >> already has both in terms of functionality and performance, scaling ? > >> Are there users who > >> would prefer whatever u r proposing as opposed to using what already > >> exists in Mahout? > >> > >> We did purge a lot of the unmaintained and non-functional code for the > >> 0.9 release and are down to where we r today. There's still room for > >> improvement in what presently exists and the project could definitely > use > >> some help there. > >> > >> With the emphasis now on supporting Spark ASAP, any new implementations > >> would not make the task any easier. There's still stuff in Mahout Math > >> that can be redone to be more flexible like the present Named Vector > (See > >> Mahout-1236). That's a very high priority for the next release, and is > >> gonna impact existing implementations once finalized. The present > codebase > >> is very heavily dependent on M/R, decoupling the relevant pieces from MR > >> api and being able to offer a potential Mahout user the choice of > different > >> execution engines (Spark or MR) is no trivial task. > >> > >> IMO, the emphasis should now be more on stabilizing, refactoring and > >> cleaning up the existing implementations (which is technical debt that's > >> building up) and porting stuff to Spark. > >> > >> > >> > >> > >> > >> On Sunday, March 16, 2014 4:39 PM, Ted Dunning <[email protected]> > >> wrote: > >> > >> OK. > >> > >> I am confused now as well. > >> > >> Even so, I would recommend that you propose a non-map-reduce but still > >> parallel version. > >> > >> Some of the confusion may stem from the fact that you can design some > >> non-map-reduce programs to run in such a way that a map-reduce execution > >> framework like Hadoop thinks that they are doing map-reduce. Instead, > >> these programs are doing whatever they feel like and just pretending to > be > >> map-reduce programs in order to get a bunch of processes launched. > >> > >> > >> > >> > >> On Sun, Mar 16, 2014 at 1:27 PM, Maciej Mazur <[email protected] > >> >wrote: > >> > >> > I have > >> one final question. > >> > > >> > I've mixed feelings about this discussion. > >> > You are saying that there is no point in doing mapreduce > implementation > >> of > >> > neural netoworks (with pretraining). > >> > Then you are thinking that non map reduce would of substatial > interest. > >> > On the other hand you say that it would be easy and it beats the > >> purpose of > >> > doing it of doing it on mahout (because it is not a mr version). > >> > Finally you are saying that building something simple and working is a > >> good > >> > thing. > >> > > >> > I do not really know what to think about it. > >> > Could you give me some advice whether I should write a proposal or > not? > >> > (And if I should: Should I propose MapReduce or not MapReduce verison? > >> > There is > >> already NN algorithm but > >> without pretraining.) > >> > > >> > Thanks, > >> > Maciej Mazur > >> > > >> > > >> > > >> > > >> > > >> > On Fri, Feb 28, 2014 at 5:44 AM, peng <[email protected]> wrote: > >> > > >> > > Oh, thanks a lot, I missed that one :) > >> > > +1 on easiest one implemented first. I haven't think about > difficulty > >> > > issue, need to read more about YARN extension. > >> > > > >> > > Yours Peng > >> > > > >> > > > >> > > On Thu 27 Feb 2014 08:06:27 PM EST, Yexi Jiang wrote: > >> > > > >> > >> Hi, Peng, > >> > >> > >> > >> Do you mean the MultilayerPerceptron? There are three 'train' > method, > >> > and > >> > >> only one (the one without the parameters trackingKey and groupKey) > is > >> > >> implemented. In current implementation, they are not used. > >> > >> > >> > >> Regards, > >> > >> Yexi > >> > >> > >> > >> > >> > >> 2014-02-27 19:31 GMT-05:00 Ted Dunning <[email protected]>: > >> > >> > >> > >> Generally for training models like this, there is an assumption > that > >> > >>> fault > >> > >>> tolerance is not > >> particularly necessary because the low risk of failure > >> > >>> trades against algorithmic speed. For reasonably small chance of > >> > >>> failure, > >> > >>> simply re-running the training is just fine. If there is high > risk > >> of > >> > >>> failure, simply checkpointing the parameter server is sufficient > to > >> > allow > >> > >>> restarts without redundancy. > >> > >>> > >> > >>> Sharding the parameter is quite possible and is reasonable when > the > >> > >>> parameter vector exceed 10's or 100's of millions of parameters, > but > >> > >>> isn't > >> > >>> likely much necessary below that. > >> > >>> > >> > >>> The asymmetry is similarly not a big > >> deal. The traffic to and from the > >> > > >> >>> parameter server isn't enormous. > >> > >>> > >> > >>> > >> > >>> Building something simple and working first is a good thing. > >> > >>> > >> > >>> > >> > >>> On Thu, Feb 27, 2014 at 3:56 PM, peng <[email protected]> > wrote: > >> > >>> > >> > >>> With pleasure! the original downpour paper propose a parameter > >> server > >> > >>>> > >> > >>> from > >> > >>> > >> > >>>> which subnodes download shards of old model and upload gradients. > >> So > >> > if > >> > >>>> > >> > >>> > >> the > >> > > >> >>> > >> > >>>> parameter server is down, the process has to be delayed, it also > >> > >>>> requires > >> > >>>> that all model parameters to be stored and atomically updated on > >> (and > >> > >>>> fetched from) a single machine, imposing asymmetric HDD and > >> bandwidth > >> > >>>> requirement. This design is necessary only because each -=delta > >> > >>>> operation > >> > >>>> has to be atomic. Which cannot be ensured across network (e.g. on > >> > HDFS). > >> > >>>> > >> > >>>> But it doesn't mean that the operation cannot be decentralized: > >> > >>>> > >> > >>> parameters > >> > >>> > >> > >>>> can be > >> sharded across multiple nodes and multiple accumulator > >> > instances > >> > >>>> > >> > >>> can > >> > >>> > >> > >>>> handle parts of the vector subtraction. This should be easy if > you > >> > >>>> > >> > >>> create a > >> > >>> > >> > >>>> buffer for the stream of gradient, and allocate proper numbers of > >> > >>>> > >> > >>> producers > >> > >>> > >> > >>>> and consumers on each machine to make sure it doesn't overflow. > >> > >>>> Obviously > >> > >>>> this is far from MR framework, but at least it can be made > >> homogeneous > >> > >>>> > >> > >>> > >> and > >> > >>> > >> > >>>> slightly faster (because sparse data can be distributed in a way > to > >> > >>>> minimize their overlapping, so gradients doesn't have to go > across > >> the > >> > >>>> network that frequent). > >> > >>>> > >> > >>>> If we instead using a centralized architect. Then there must be > >=1 > >> > >>>> > >> > >>> backup > >> > >>> > >> > >>>> parameter server for mission critical training. > >> > >>>> > >> > >>>> Yours Peng > >> > >>>> > >> > >>>> e.g. we can simply use a producer/consumer pattern > >> > >>>> > >> > >>>> If we use a > >> producer/consumer pattern for all gradients, > >> > >>>> > >> > >>>> On Thu 27 Feb 2014 05:09:52 PM EST, Yexi Jiang wrote: > >> > >>>> > >> > >>>> Peng, > >> > >>>>> > >> > >>>>> Can you provide more details about your thought? > >> > >>>>> > >> > >>>>> Regards, > >> > >>>>> > >> > >>>>> > >> > >>>>> 2014-02-27 16:00 GMT-05:00 peng <[email protected]>: > >> > >>>>> > >> > >>>>> That should be easy. But that defeats the purpose of using > >> mahout > >> > as > >> > >>>>> > >> > >>>>>> there > >> > >>>>>> are already enough implementations of single node > backpropagation > >> > (in > >> > >>>>>> which > >> > >>>>>> case GPU is much faster). > >> > >>>>>> > >> > >>>>>> Yexi: > >> > >>>>>> > >> > >>>>>> Regarding downpour SGD and sandblaster, may I suggest that the > >> > >>>>>> implementation better has no parameter server? It's obviously a > >> > single > >> > >>>>>> point of failure and in terms of bandwidth, a bottleneck. I > heard > >> > that > >> > >>>>>> MLlib on top of > >> Spark has a functional > >> implementation (never read or > >> > >>>>>> > >> > >>>>> test > >> > >>> > >> > >>>> it), and its possible to build the workflow on top of YARN. Non > of > >> > >>>>>> > >> > >>>>> those > >> > >>> > >> > >>>> framework has an heterogeneous topology. > >> > >>>>>> > >> > >>>>>> Yours Peng > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> On Thu 27 Feb 2014 09:43:19 AM EST, Maciej Mazur (JIRA) wrote: > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> [ > >> https://issues.apache.org/jira/browse/MAHOUT-1426?page= > >> > >>>>>>> com.atlassian.jira.plugin.system.issuetabpanels:comment- > >> > >>>>>>> tabpanel&focusedCommentId=13913488#comment-13913488 ] > >> > >>>>>>> > >> > >>>>>>> Maciej Mazur edited comment on MAHOUT-1426 at 2/27/14 2:41 PM: > >> > >>>>>>> > --------------------------------------------------------------- > >> > >>>>>>> > >> > >>>>>>> I've read the papers. I didn't think about distributed > network. > >> I > >> > had > >> > >>>>>>> > >> > >>>>>> in > >> > > >> >>> > >> > >>>> mind network that will fit into memory, but will require > >> significant > >> > >>>>>>> amount > >> > >>>>>>> of computations. > >> > >>>>>>> > >> > >>>>>>> I understand that there are better options for neural networks > >> than > >> > >>>>>>> > >> > >>>>>> map > >> > >>> > >> > >>>> reduce. > >> > >>>>>>> How about non-map-reduce version? > >> > >>>>>>> I see that you think it is something that would make a sense. > >> > (Doing > >> > >>>>>>> a > >> > >>>>>>> non-map-reduce neural network in Mahout would be > >> of > >> substantial > >> > >>>>>>> interest.) > >> > >>>>>>> Do you think it will be a valueable contribution? > >> > >>>>>>> Is there a need for this type of algorithm? > >> > >>>>>>> I think about multi-threded batch gradient descent with > >> pretraining > >> > >>>>>>> > >> > >>>>>> (RBM > >> > >>> > >> > >>>> or/and Autoencoders). > >> > >>>>>>> > >> > >>>>>>> I have looked into these old JIRAs. RBM patch was withdrawn. > >> > >>>>>>> "I would rather like to withdraw that patch, because by the > >> time i > >> > >>>>>>> implemented it i didn't know that the learning algorithm is > >> not > >> > >>>>>>> suited > >> > >>>>>>> for > >> > >>>>>>> MR, so I think there is no point including the patch." > >> > >>>>>>> > >> > >>>>>>> > >> > >>>>>>> was (Author: maciejmazur): > >> > >>>>>>> I've read the papers. I didn't think about distributed > network. > >> I > >> > had > >> > >>>>>>> > >> > >>>>>> in > >> > >>> > >> > >>>> mind network that will fit into memory, but will require > >> significant > >> > >>>>>>> amount > >> > >>>>>>> of computations. > >> > >>>>>>> > >> > >>>>>>> I understand that there are better options for neural networks > >> than > >> > >>>>>>> > >> > >>>>>> map > >> > >>> > >> > >>>> reduce. > >> > >>>>>>> How about non-map-reduce version? > >> > >>>>>>> I see that you think it is something that would make a sense. > >> > >>>>>>> Do you think it will be a valueable contribution? > >> > >>>>>>> Is there a need for this type of algorithm? > >> > >>>>>>> I think about multi-threded batch gradient descent with > >> pretraining > >> > >>>>>>> > >> > >>>>>> (RBM > >> > >>> > >> > > >> >>>> or/and Autoencoders). > >> > >>>>>>> > >> > >>>>>>> I have looked into these old JIRAs. RBM patch was withdrawn. > >> > >>>>>>> "I would rather like to withdraw that patch, because by the > >> time i > >> > >>>>>>> implemented it i didn't know that the learning algorithm is > not > >> > >>>>>>> suited > >> > >>>>>>> for > >> > >>>>>>> MR, so I think there is no point including the patch." > >> > >>>>>>> > >> > >>>>>>> GSOC 2013 Neural network algorithms > >> > >>>>>>> > >> > >>>>>>> ----------------------------------- > >> > > >> >>>>>>>> > >> > >>>>>>>> Key: MAHOUT-1426 > >> > >>>>>>>> URL: https://issues.apache.org/ > >> > >>>>>>>> jira/browse/MAHOUT-1426 > >> > >>>>>>>> Project: Mahout > >> > >>>>>>>> Issue Type: Improvement > >> > >>>>>>>> Components: Classification > >> > >>>>>>>> > >> Reporter: Maciej > >> Mazur > >> > >>>>>>>> > >> > >>>>>>>> I would like to ask about possibilites of implementing neural > >> > >>>>>>>> network > >> > >>>>>>>> algorithms in mahout during GSOC. > >> > >>>>>>>> There is a classifier.mlp package with neural network. > >> > >>>>>>>> I can't see neighter RBM nor Autoencoder in these classes. > >> > >>>>>>>> There is only one word about Autoencoders in NeuralNetwork > >> class. > >> > >>>>>>>> As far as I know Mahout doesn't support convolutional > networks. > >> > >>>>>>>> Is it a good idea to implement one of these algorithms? > >> > >>>>>>>> Is it a > >> reasonable amount of work? > >> > >>>>>>>> How hard is it to get GSOC in Mahout? > >> > >>>>>>>> Did anyone succeed last year? > >> > >>>>>>>> > >> > >>>>>>>> > >> > >>>>>>>> > >> > >>>>>>> > >> > >>>>>>> -- > >> > >>>>>>> This message was sent by Atlassian JIRA > >> > >>>>>>> (v6.1.5#6160) > >> > >>>>>>> > >> > >>>>>>> > >> > >>>>>>> > >> > >>>>>> > >> > >>>>> > >> > >>>>> > >> > >>> > >> > > >> >> > >> > >> > >> > >> > >> > > >> > > > > >
