Ok, I think you are right. Although it would be a valuable experience, I will have to leave it. Thanks for your feedback. I understand that is not the best use of map reduce. I'm not going to propose this project. Now this issue can be closed.
On Wed, Mar 19, 2014 at 11:01 PM, Ted Dunning <[email protected]> wrote: > I really think that a true downpour architecture is actually easier than > what you suggest and much better for the purpose. > > > > > On Wed, Mar 19, 2014 at 1:28 PM, Maciej Mazur <[email protected] > >wrote: > > > Any comments? > > I think it will work. If I will do one long lasting job, hack the file > > system from mapper in order to repeateadly update weights, perform mini > > batch GD, and store updates in some folder. > > In the background I could call small jobs for gathering gradients and > > updating weights. > > > > > > On Tue, Mar 18, 2014 at 10:11 PM, Maciej Mazur <[email protected] > > >wrote: > > > > > I'll say what I think about it. > > > > > > I know that mahout is currently heading in different direction. You are > > > working on refactoring, improving existing api and migrating to Spark. > I > > > know that there is a great deal of work to do there. I would also like > to > > > help with that. > > > > > > I am impressed by results achieved by using Neural Networks. Generally > > > speaking I think that NN give significant advantage over other methods > in > > > wide range of problems. It beats other state of the art algorithms in > > > various areas. I think that in the future this algorithm will play even > > > greater role. > > > That's why I came up with an idea to implement neural networks. > > > > > > When it comes to functionality: pretraining (RBM), training > > (SGD/minibatch > > > gradient descent + backpropagation + momentum) and classification. > > > > > > Unfortunately mapreduce is illsuited for NNs. > > > The biggest problem is how to reduce the number of iterations. > > > It is possible to divide data and use momentum applied to edges - it > > helps > > > a little, but doesn't solve the problem. > > > > > > I've some idea of not exactly mapreduce implementation. But I am not > sure > > > whether it is possible using this infrastructure. For sure it is not > > plain > > > map reduce. > > > In other distributed NNs implementation there are asynchronic > operations. > > > Is it possible to take adventage of asynchrony? > > > At first I would separate data, some subset on every node. > > > On each node I will use a number of files (directories) for storing > > > weights. > > > Each machile will use these files to count the cost function and update > > > gradient. > > > In the background multiple reduce job will average gradients for some > > > subsets of weights (one file). > > > Then asynchronously update some subset of weights (from one file). > > > In a way this idea is similar to Downpour SGD from > > > > > > http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/large_deep_networks_nips2012.pdf > > > > > > There are couple of problems here. Is it a feasible solution? > > > > > > > > > Parallel implementation is very complex. It's hard to design something > > > that uses mapreduce, but it's not a mapreduce algorithm. > > > Definitely you are more experienced than me and I'll need a lot of > help. > > I > > > may not be aware of some limitations. > > > > > > From my perspective it would be a great experience, even if I could do > > > something other than NNs. Frankly speaking I think I'll stay here > > > regardless of whether my propasal will be accepted. It'll be a great > > > opportunity to learn. > > > > > > > > > > > > > > > On Mon, Mar 17, 2014 at 5:27 AM, Suneel Marthi < > [email protected] > > >wrote: > > > > > >> I would suggest looking at deeplearning4j.org (they went public very > > >> recently) and see how they had utilized Iterative Reduce for > > implementing > > >> Neural Nets. > > >> > > >> Not sure given the present state of flux on the project if we should > > even > > >> be considering adding any new algorithms. The existing ones can be > > >> refactored to be more API driven (for both clustering and > > classification) > > >> and that's no trivial effort and could definitely use lot of help. > > >> > > >> How is what u r proposing gonna be any better than similar existing > > >> implementations that Mahout > > >> already has both in terms of functionality and performance, scaling ? > > >> Are there users who > > >> would prefer whatever u r proposing as opposed to using what already > > >> exists in Mahout? > > >> > > >> We did purge a lot of the unmaintained and non-functional code for the > > >> 0.9 release and are down to where we r today. There's still room for > > >> improvement in what presently exists and the project could definitely > > use > > >> some help there. > > >> > > >> With the emphasis now on supporting Spark ASAP, any new > implementations > > >> would not make the task any easier. There's still stuff in Mahout > Math > > >> that can be redone to be more flexible like the present Named Vector > > (See > > >> Mahout-1236). That's a very high priority for the next release, and is > > >> gonna impact existing implementations once finalized. The present > > codebase > > >> is very heavily dependent on M/R, decoupling the relevant pieces from > MR > > >> api and being able to offer a potential Mahout user the choice of > > different > > >> execution engines (Spark or MR) is no trivial task. > > >> > > >> IMO, the emphasis should now be more on stabilizing, refactoring and > > >> cleaning up the existing implementations (which is technical debt > that's > > >> building up) and porting stuff to Spark. > > >> > > >> > > >> > > >> > > >> > > >> On Sunday, March 16, 2014 4:39 PM, Ted Dunning <[email protected] > > > > >> wrote: > > >> > > >> OK. > > >> > > >> I am confused now as well. > > >> > > >> Even so, I would recommend that you propose a non-map-reduce but still > > >> parallel version. > > >> > > >> Some of the confusion may stem from the fact that you can design some > > >> non-map-reduce programs to run in such a way that a map-reduce > execution > > >> framework like Hadoop thinks that they are doing map-reduce. Instead, > > >> these programs are doing whatever they feel like and just pretending > to > > be > > >> map-reduce programs in order to get a bunch of processes launched. > > >> > > >> > > >> > > >> > > >> On Sun, Mar 16, 2014 at 1:27 PM, Maciej Mazur <[email protected] > > >> >wrote: > > >> > > >> > I have > > >> one final question. > > >> > > > >> > I've mixed feelings about this discussion. > > >> > You are saying that there is no point in doing mapreduce > > implementation > > >> of > > >> > neural netoworks (with pretraining). > > >> > Then you are thinking that non map reduce would of substatial > > interest. > > >> > On the other hand you say that it would be easy and it beats the > > >> purpose of > > >> > doing it of doing it on mahout (because it is not a mr version). > > >> > Finally you are saying that building something simple and working > is a > > >> good > > >> > thing. > > >> > > > >> > I do not really know what to think about it. > > >> > Could you give me some advice whether I should write a proposal or > > not? > > >> > (And if I should: Should I propose MapReduce or not MapReduce > verison? > > >> > There is > > >> already NN algorithm but > > >> without pretraining.) > > >> > > > >> > Thanks, > > >> > Maciej Mazur > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > On Fri, Feb 28, 2014 at 5:44 AM, peng <[email protected]> wrote: > > >> > > > >> > > Oh, thanks a lot, I missed that one :) > > >> > > +1 on easiest one implemented first. I haven't think about > > difficulty > > >> > > issue, need to read more about YARN extension. > > >> > > > > >> > > Yours Peng > > >> > > > > >> > > > > >> > > On Thu 27 Feb 2014 08:06:27 PM EST, Yexi Jiang wrote: > > >> > > > > >> > >> Hi, Peng, > > >> > >> > > >> > >> Do you mean the MultilayerPerceptron? There are three 'train' > > method, > > >> > and > > >> > >> only one (the one without the parameters trackingKey and > groupKey) > > is > > >> > >> implemented. In current implementation, they are not used. > > >> > >> > > >> > >> Regards, > > >> > >> Yexi > > >> > >> > > >> > >> > > >> > >> 2014-02-27 19:31 GMT-05:00 Ted Dunning <[email protected]>: > > >> > >> > > >> > >> Generally for training models like this, there is an assumption > > that > > >> > >>> fault > > >> > >>> tolerance is not > > >> particularly necessary because the low risk of failure > > >> > >>> trades against algorithmic speed. For reasonably small chance > of > > >> > >>> failure, > > >> > >>> simply re-running the training is just fine. If there is high > > risk > > >> of > > >> > >>> failure, simply checkpointing the parameter server is sufficient > > to > > >> > allow > > >> > >>> restarts without redundancy. > > >> > >>> > > >> > >>> Sharding the parameter is quite possible and is reasonable when > > the > > >> > >>> parameter vector exceed 10's or 100's of millions of parameters, > > but > > >> > >>> isn't > > >> > >>> likely much necessary below that. > > >> > >>> > > >> > >>> The asymmetry is similarly not a big > > >> deal. The traffic to and from the > > >> > > > >> >>> parameter server isn't enormous. > > >> > >>> > > >> > >>> > > >> > >>> Building something simple and working first is a good thing. > > >> > >>> > > >> > >>> > > >> > >>> On Thu, Feb 27, 2014 at 3:56 PM, peng <[email protected]> > > wrote: > > >> > >>> > > >> > >>> With pleasure! the original downpour paper propose a parameter > > >> server > > >> > >>>> > > >> > >>> from > > >> > >>> > > >> > >>>> which subnodes download shards of old model and upload > gradients. > > >> So > > >> > if > > >> > >>>> > > >> > >>> > > >> the > > >> > > > >> >>> > > >> > >>>> parameter server is down, the process has to be delayed, it > also > > >> > >>>> requires > > >> > >>>> that all model parameters to be stored and atomically updated > on > > >> (and > > >> > >>>> fetched from) a single machine, imposing asymmetric HDD and > > >> bandwidth > > >> > >>>> requirement. This design is necessary only because each -=delta > > >> > >>>> operation > > >> > >>>> has to be atomic. Which cannot be ensured across network (e.g. > on > > >> > HDFS). > > >> > >>>> > > >> > >>>> But it doesn't mean that the operation cannot be decentralized: > > >> > >>>> > > >> > >>> parameters > > >> > >>> > > >> > >>>> can be > > >> sharded across multiple nodes and multiple accumulator > > >> > instances > > >> > >>>> > > >> > >>> can > > >> > >>> > > >> > >>>> handle parts of the vector subtraction. This should be easy if > > you > > >> > >>>> > > >> > >>> create a > > >> > >>> > > >> > >>>> buffer for the stream of gradient, and allocate proper numbers > of > > >> > >>>> > > >> > >>> producers > > >> > >>> > > >> > >>>> and consumers on each machine to make sure it doesn't overflow. > > >> > >>>> Obviously > > >> > >>>> this is far from MR framework, but at least it can be made > > >> homogeneous > > >> > >>>> > > >> > >>> > > >> and > > >> > >>> > > >> > >>>> slightly faster (because sparse data can be distributed in a > way > > to > > >> > >>>> minimize their overlapping, so gradients doesn't have to go > > across > > >> the > > >> > >>>> network that frequent). > > >> > >>>> > > >> > >>>> If we instead using a centralized architect. Then there must be > > >=1 > > >> > >>>> > > >> > >>> backup > > >> > >>> > > >> > >>>> parameter server for mission critical training. > > >> > >>>> > > >> > >>>> Yours Peng > > >> > >>>> > > >> > >>>> e.g. we can simply use a producer/consumer pattern > > >> > >>>> > > >> > >>>> If we use a > > >> producer/consumer pattern for all gradients, > > >> > >>>> > > >> > >>>> On Thu 27 Feb 2014 05:09:52 PM EST, Yexi Jiang wrote: > > >> > >>>> > > >> > >>>> Peng, > > >> > >>>>> > > >> > >>>>> Can you provide more details about your thought? > > >> > >>>>> > > >> > >>>>> Regards, > > >> > >>>>> > > >> > >>>>> > > >> > >>>>> 2014-02-27 16:00 GMT-05:00 peng <[email protected]>: > > >> > >>>>> > > >> > >>>>> That should be easy. But that defeats the purpose of using > > >> mahout > > >> > as > > >> > >>>>> > > >> > >>>>>> there > > >> > >>>>>> are already enough implementations of single node > > backpropagation > > >> > (in > > >> > >>>>>> which > > >> > >>>>>> case GPU is much faster). > > >> > >>>>>> > > >> > >>>>>> Yexi: > > >> > >>>>>> > > >> > >>>>>> Regarding downpour SGD and sandblaster, may I suggest that > the > > >> > >>>>>> implementation better has no parameter server? It's > obviously a > > >> > single > > >> > >>>>>> point of failure and in terms of bandwidth, a bottleneck. I > > heard > > >> > that > > >> > >>>>>> MLlib on top of > > >> Spark has a functional > > >> implementation (never read or > > >> > >>>>>> > > >> > >>>>> test > > >> > >>> > > >> > >>>> it), and its possible to build the workflow on top of YARN. Non > > of > > >> > >>>>>> > > >> > >>>>> those > > >> > >>> > > >> > >>>> framework has an heterogeneous topology. > > >> > >>>>>> > > >> > >>>>>> Yours Peng > > >> > >>>>>> > > >> > >>>>>> > > >> > >>>>>> On Thu 27 Feb 2014 09:43:19 AM EST, Maciej Mazur (JIRA) > wrote: > > >> > >>>>>> > > >> > >>>>>> > > >> > >>>>>> [ > > >> https://issues.apache.org/jira/browse/MAHOUT-1426?page= > > >> > >>>>>>> com.atlassian.jira.plugin.system.issuetabpanels:comment- > > >> > >>>>>>> tabpanel&focusedCommentId=13913488#comment-13913488 ] > > >> > >>>>>>> > > >> > >>>>>>> Maciej Mazur edited comment on MAHOUT-1426 at 2/27/14 2:41 > PM: > > >> > >>>>>>> > > --------------------------------------------------------------- > > >> > >>>>>>> > > >> > >>>>>>> I've read the papers. I didn't think about distributed > > network. > > >> I > > >> > had > > >> > >>>>>>> > > >> > >>>>>> in > > >> > > > >> >>> > > >> > >>>> mind network that will fit into memory, but will require > > >> significant > > >> > >>>>>>> amount > > >> > >>>>>>> of computations. > > >> > >>>>>>> > > >> > >>>>>>> I understand that there are better options for neural > networks > > >> than > > >> > >>>>>>> > > >> > >>>>>> map > > >> > >>> > > >> > >>>> reduce. > > >> > >>>>>>> How about non-map-reduce version? > > >> > >>>>>>> I see that you think it is something that would make a > sense. > > >> > (Doing > > >> > >>>>>>> a > > >> > >>>>>>> non-map-reduce neural network in Mahout would be > > >> of > > >> substantial > > >> > >>>>>>> interest.) > > >> > >>>>>>> Do you think it will be a valueable contribution? > > >> > >>>>>>> Is there a need for this type of algorithm? > > >> > >>>>>>> I think about multi-threded batch gradient descent with > > >> pretraining > > >> > >>>>>>> > > >> > >>>>>> (RBM > > >> > >>> > > >> > >>>> or/and Autoencoders). > > >> > >>>>>>> > > >> > >>>>>>> I have looked into these old JIRAs. RBM patch was withdrawn. > > >> > >>>>>>> "I would rather like to withdraw that patch, because by the > > >> time i > > >> > >>>>>>> implemented it i didn't know that the learning algorithm is > > >> not > > >> > >>>>>>> suited > > >> > >>>>>>> for > > >> > >>>>>>> MR, so I think there is no point including the patch." > > >> > >>>>>>> > > >> > >>>>>>> > > >> > >>>>>>> was (Author: maciejmazur): > > >> > >>>>>>> I've read the papers. I didn't think about distributed > > network. > > >> I > > >> > had > > >> > >>>>>>> > > >> > >>>>>> in > > >> > >>> > > >> > >>>> mind network that will fit into memory, but will require > > >> significant > > >> > >>>>>>> amount > > >> > >>>>>>> of computations. > > >> > >>>>>>> > > >> > >>>>>>> I understand that there are better options for neural > networks > > >> than > > >> > >>>>>>> > > >> > >>>>>> map > > >> > >>> > > >> > >>>> reduce. > > >> > >>>>>>> How about non-map-reduce version? > > >> > >>>>>>> I see that you think it is something that would make a > sense. > > >> > >>>>>>> Do you think it will be a valueable contribution? > > >> > >>>>>>> Is there a need for this type of algorithm? > > >> > >>>>>>> I think about multi-threded batch gradient descent with > > >> pretraining > > >> > >>>>>>> > > >> > >>>>>> (RBM > > >> > >>> > > >> > > > >> >>>> or/and Autoencoders). > > >> > >>>>>>> > > >> > >>>>>>> I have looked into these old JIRAs. RBM patch was withdrawn. > > >> > >>>>>>> "I would rather like to withdraw that patch, because by the > > >> time i > > >> > >>>>>>> implemented it i didn't know that the learning algorithm is > > not > > >> > >>>>>>> suited > > >> > >>>>>>> for > > >> > >>>>>>> MR, so I think there is no point including the patch." > > >> > >>>>>>> > > >> > >>>>>>> GSOC 2013 Neural network algorithms > > >> > >>>>>>> > > >> > >>>>>>> ----------------------------------- > > >> > > > >> >>>>>>>> > > >> > >>>>>>>> Key: MAHOUT-1426 > > >> > >>>>>>>> URL: https://issues.apache.org/ > > >> > >>>>>>>> jira/browse/MAHOUT-1426 > > >> > >>>>>>>> Project: Mahout > > >> > >>>>>>>> Issue Type: Improvement > > >> > >>>>>>>> Components: Classification > > >> > >>>>>>>> > > >> Reporter: Maciej > > >> Mazur > > >> > >>>>>>>> > > >> > >>>>>>>> I would like to ask about possibilites of implementing > neural > > >> > >>>>>>>> network > > >> > >>>>>>>> algorithms in mahout during GSOC. > > >> > >>>>>>>> There is a classifier.mlp package with neural network. > > >> > >>>>>>>> I can't see neighter RBM nor Autoencoder in these classes. > > >> > >>>>>>>> There is only one word about Autoencoders in NeuralNetwork > > >> class. > > >> > >>>>>>>> As far as I know Mahout doesn't support convolutional > > networks. > > >> > >>>>>>>> Is it a good idea to implement one of these algorithms? > > >> > >>>>>>>> Is it a > > >> reasonable amount of work? > > >> > >>>>>>>> How hard is it to get GSOC in Mahout? > > >> > >>>>>>>> Did anyone succeed last year? > > >> > >>>>>>>> > > >> > >>>>>>>> > > >> > >>>>>>>> > > >> > >>>>>>> > > >> > >>>>>>> -- > > >> > >>>>>>> This message was sent by Atlassian JIRA > > >> > >>>>>>> (v6.1.5#6160) > > >> > >>>>>>> > > >> > >>>>>>> > > >> > >>>>>>> > > >> > >>>>>> > > >> > >>>>> > > >> > >>>>> > > >> > >>> > > >> > > > >> >> > > >> > >> > > >> > >> > > >> > > > >> > > > > > > > > >
