Re: [jira] [Comment Edited] (MAHOUT-1426) GSOC 2013 Neural network algorithms

Ted Dunning Wed, 19 Mar 2014 15:06:06 -0700

I really think that a true downpour architecture is actually easier than
what you suggest and much better for the purpose.





On Wed, Mar 19, 2014 at 1:28 PM, Maciej Mazur <[email protected]>wrote:

> Any comments?
> I think it will work. If I will do one long lasting job, hack the file
> system from mapper in order to repeateadly update weights, perform mini
> batch GD, and store updates in some folder.
> In the background I could call small jobs for gathering gradients and
> updating weights.
>
>
> On Tue, Mar 18, 2014 at 10:11 PM, Maciej Mazur <[email protected]
> >wrote:
>
> > I'll say what I think about it.
> >
> > I know that mahout is currently heading in different direction. You are
> > working on refactoring, improving existing api and migrating to Spark. I
> > know that there is a great deal of work to do there. I would also like to
> > help with that.
> >
> > I am impressed by results achieved by using Neural Networks. Generally
> > speaking I think that NN give significant advantage over other methods in
> > wide range of problems. It beats other state of the art algorithms in
> > various areas. I think that in the future this algorithm will play even
> > greater role.
> > That's why I came up with an idea to implement neural networks.
> >
> > When it comes to functionality: pretraining (RBM), training
> (SGD/minibatch
> > gradient descent + backpropagation + momentum) and classification.
> >
> > Unfortunately mapreduce is illsuited for NNs.
> > The biggest problem is how to reduce the number of iterations.
> > It is possible to divide data and use momentum applied to edges - it
> helps
> > a little, but doesn't solve the problem.
> >
> > I've some idea of not exactly mapreduce implementation. But I am not sure
> > whether it is possible using this infrastructure. For sure it is not
> plain
> > map reduce.
> > In other distributed NNs implementation there are asynchronic operations.
> > Is it possible to take adventage of asynchrony?
> > At first I would separate data, some subset on every node.
> > On each node I will use a number of files (directories) for storing
> > weights.
> > Each machile will use these files to count the cost function and update
> > gradient.
> > In the background multiple reduce job will average gradients for some
> > subsets of weights (one file).
> > Then asynchronously update some subset of weights (from one file).
> > In a way this idea is similar to Downpour SGD from
> >
> http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/large_deep_networks_nips2012.pdf
> >
> > There are couple of problems here. Is it a feasible solution?
> >
> >
> > Parallel implementation is very complex. It's hard to design something
> > that uses mapreduce, but it's not a mapreduce algorithm.
> > Definitely you are more experienced than me and I'll need a lot of help.
> I
> > may not be aware of some limitations.
> >
> > From my perspective it would be a great experience, even if I could do
> > something other than NNs. Frankly speaking I think I'll stay here
> > regardless of whether my propasal will be accepted. It'll be a great
> > opportunity to learn.
> >
> >
> >
> >
> > On Mon, Mar 17, 2014 at 5:27 AM, Suneel Marthi <[email protected]
> >wrote:
> >
> >> I would suggest looking at deeplearning4j.org (they went public very
> >> recently) and see how they had utilized Iterative Reduce for
> implementing
> >> Neural Nets.
> >>
> >> Not sure given the present state of flux on the project if we should
> even
> >> be considering adding any new algorithms. The existing ones can be
> >> refactored to be more API driven (for both clustering and
> classification)
> >> and that's no trivial effort and could definitely use lot of help.
> >>
> >> How is what u r proposing gonna be any better than similar existing
> >> implementations that Mahout
> >> already has both in terms of functionality and performance, scaling ?
> >> Are there users who
> >> would prefer whatever u r proposing as opposed to using what already
> >> exists in Mahout?
> >>
> >> We did purge a lot of the unmaintained and non-functional code for the
> >> 0.9 release and are down to where we r today. There's still room for
> >> improvement in what presently exists and the project could definitely
> use
> >> some help there.
> >>
> >> With the emphasis now on supporting Spark ASAP, any new implementations
> >> would not make the task any easier.  There's still stuff in Mahout Math
> >> that can be redone to be more flexible like the present Named Vector
> (See
> >> Mahout-1236). That's a very high priority for the next release, and is
> >> gonna impact existing implementations once finalized. The present
> codebase
> >> is very heavily dependent on M/R, decoupling the relevant pieces from MR
> >> api and being able to offer a potential Mahout user the choice of
> different
> >> execution engines (Spark or MR) is no trivial task.
> >>
> >> IMO, the emphasis should now be more on stabilizing, refactoring and
> >> cleaning up the existing implementations (which is technical debt that's
> >> building up) and porting stuff to Spark.
> >>
> >>
> >>
> >>
> >>
> >> On Sunday, March 16, 2014 4:39 PM, Ted Dunning <[email protected]>
> >> wrote:
> >>
> >> OK.
> >>
> >> I am confused now as well.
> >>
> >> Even so, I would recommend that you propose a non-map-reduce but still
> >> parallel version.
> >>
> >> Some of the confusion may stem from the fact that you can design some
> >> non-map-reduce programs to run in such a way that a map-reduce execution
> >> framework like Hadoop thinks that they are doing map-reduce.  Instead,
> >> these programs are doing whatever they feel like and just pretending to
> be
> >> map-reduce programs in order to get a bunch of processes launched.
> >>
> >>
> >>
> >>
> >> On Sun, Mar 16, 2014 at 1:27 PM, Maciej Mazur <[email protected]
> >> >wrote:
> >>
> >> > I have
> >>  one final question.
> >> >
> >> > I've mixed feelings about this discussion.
> >> > You are saying that there is no point in doing mapreduce
> implementation
> >> of
> >> > neural netoworks (with pretraining).
> >> > Then you are thinking that non map reduce would of substatial
> interest.
> >> > On the other hand you say that it would be easy and it beats the
> >> purpose of
> >> > doing it of doing it on mahout (because it is not a mr version).
> >> > Finally you are saying that building something simple and working is a
> >> good
> >> > thing.
> >> >
> >> > I do not really know what to think about it.
> >> > Could you give me some advice whether I should write a proposal or
> not?
> >> > (And if I should: Should I propose MapReduce or not MapReduce verison?
> >> > There is
> >>  already NN algorithm but
> >>  without pretraining.)
> >> >
> >> > Thanks,
> >> > Maciej Mazur
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Feb 28, 2014 at 5:44 AM, peng <[email protected]> wrote:
> >> >
> >> > > Oh, thanks a lot, I missed that one :)
> >> > > +1 on easiest one implemented first. I haven't think about
> difficulty
> >> > > issue, need  to read more about YARN extension.
> >> > >
> >> > > Yours Peng
> >> > >
> >> > >
> >> > > On Thu 27 Feb 2014 08:06:27 PM EST, Yexi Jiang wrote:
> >> > >
> >> > >> Hi, Peng,
> >> > >>
> >> > >> Do you mean the MultilayerPerceptron? There are three 'train'
> method,
> >> > and
> >> > >> only one (the one without the parameters trackingKey and groupKey)
> is
> >> > >> implemented. In current implementation, they are not used.
> >> > >>
> >> > >> Regards,
> >> > >> Yexi
> >> > >>
> >> > >>
> >> > >> 2014-02-27 19:31 GMT-05:00 Ted Dunning <[email protected]>:
> >> > >>
> >> > >>  Generally for training models like this, there is an assumption
> that
> >> > >>> fault
> >> > >>> tolerance is not
> >>  particularly necessary because the low risk of failure
> >> > >>> trades against algorithmic speed.  For reasonably small chance of
> >> > >>> failure,
> >> > >>> simply re-running the training is just fine.  If there is high
> risk
> >> of
> >> > >>> failure, simply checkpointing the parameter server is sufficient
> to
> >> > allow
> >> > >>> restarts without redundancy.
> >> > >>>
> >> > >>> Sharding the parameter is quite possible and is reasonable when
> the
> >> > >>> parameter vector exceed 10's or 100's of millions of parameters,
> but
> >> > >>> isn't
> >> > >>> likely much necessary below that.
> >> > >>>
> >> > >>> The asymmetry is similarly not a big
> >>  deal.  The traffic to and from the
> >> >
> >>  >>> parameter server isn't enormous.
> >> > >>>
> >> > >>>
> >> > >>> Building something simple and working first is a good thing.
> >> > >>>
> >> > >>>
> >> > >>> On Thu, Feb 27, 2014 at 3:56 PM, peng <[email protected]>
> wrote:
> >> > >>>
> >> > >>>  With pleasure! the original downpour paper propose a parameter
> >> server
> >> > >>>>
> >> > >>> from
> >> > >>>
> >> > >>>> which subnodes download shards of old model and upload gradients.
> >> So
> >> > if
> >> > >>>>
> >> > >>>
> >>  the
> >> >
> >>  >>>
> >> > >>>> parameter server is down, the process has to be delayed, it also
> >> > >>>> requires
> >> > >>>> that all model parameters to be stored and atomically updated on
> >> (and
> >> > >>>> fetched from) a single machine, imposing asymmetric HDD and
> >> bandwidth
> >> > >>>> requirement. This design is necessary only because each -=delta
> >> > >>>> operation
> >> > >>>> has to be atomic. Which cannot be ensured across network (e.g. on
> >> > HDFS).
> >> > >>>>
> >> > >>>> But it doesn't mean that the operation cannot be decentralized:
> >> > >>>>
> >> > >>> parameters
> >> > >>>
> >> > >>>> can be
> >>  sharded across multiple nodes and multiple accumulator
> >> > instances
> >> > >>>>
> >> > >>> can
> >> > >>>
> >> > >>>> handle parts of the vector subtraction. This should be easy if
> you
> >> > >>>>
> >> > >>> create a
> >> > >>>
> >> > >>>> buffer for the stream of gradient, and allocate proper numbers of
> >> > >>>>
> >> > >>> producers
> >> > >>>
> >> > >>>> and consumers on each machine to make sure it doesn't overflow.
> >> > >>>> Obviously
> >> > >>>> this is far from MR framework, but at least it can be made
> >> homogeneous
> >> > >>>>
> >> > >>>
> >>  and
> >> > >>>
> >> > >>>> slightly faster (because sparse data can be distributed in a way
> to
> >> > >>>> minimize their overlapping, so gradients doesn't have to go
> across
> >> the
> >> > >>>> network that frequent).
> >> > >>>>
> >> > >>>> If we instead using a centralized architect. Then there must be
> >=1
> >> > >>>>
> >> > >>> backup
> >> > >>>
> >> > >>>> parameter server for mission critical training.
> >> > >>>>
> >> > >>>> Yours Peng
> >> > >>>>
> >> > >>>> e.g. we can simply use a producer/consumer pattern
> >> > >>>>
> >> > >>>> If we use a
> >>  producer/consumer pattern for all gradients,
> >> > >>>>
> >> > >>>> On Thu 27 Feb 2014 05:09:52 PM EST, Yexi Jiang wrote:
> >> > >>>>
> >> > >>>>  Peng,
> >> > >>>>>
> >> > >>>>> Can you provide more details about your thought?
> >> > >>>>>
> >> > >>>>> Regards,
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> 2014-02-27 16:00 GMT-05:00 peng <[email protected]>:
> >> > >>>>>
> >> > >>>>>   That should be easy. But that defeats the purpose of using
> >> mahout
> >> > as
> >> > >>>>>
> >> > >>>>>> there
> >> > >>>>>> are already enough implementations of single node
> backpropagation
> >> > (in
> >> > >>>>>> which
> >> > >>>>>> case GPU is much faster).
> >> > >>>>>>
> >> > >>>>>> Yexi:
> >> > >>>>>>
> >> > >>>>>> Regarding downpour SGD and sandblaster, may I suggest that the
> >> > >>>>>> implementation better has no parameter server? It's obviously a
> >> > single
> >> > >>>>>> point of failure and in terms of bandwidth, a bottleneck. I
> heard
> >> > that
> >> > >>>>>> MLlib on top of
> >>  Spark has a functional
> >>  implementation (never read or
> >> > >>>>>>
> >> > >>>>> test
> >> > >>>
> >> > >>>> it), and its possible to build the workflow on top of YARN. Non
> of
> >> > >>>>>>
> >> > >>>>> those
> >> > >>>
> >> > >>>> framework has an heterogeneous topology.
> >> > >>>>>>
> >> > >>>>>> Yours Peng
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>> On Thu 27 Feb 2014 09:43:19 AM EST, Maciej Mazur (JIRA) wrote:
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>>         [
> >> https://issues.apache.org/jira/browse/MAHOUT-1426?page=
> >> > >>>>>>> com.atlassian.jira.plugin.system.issuetabpanels:comment-
> >> > >>>>>>> tabpanel&focusedCommentId=13913488#comment-13913488 ]
> >> > >>>>>>>
> >> > >>>>>>> Maciej Mazur edited comment on MAHOUT-1426 at 2/27/14 2:41 PM:
> >> > >>>>>>>
> ---------------------------------------------------------------
> >> > >>>>>>>
> >> > >>>>>>> I've read the papers. I didn't think about distributed
> network.
> >> I
> >> > had
> >> > >>>>>>>
> >> > >>>>>> in
> >> >
> >>  >>>
> >> > >>>> mind network that will fit into memory, but will require
> >> significant
> >> > >>>>>>> amount
> >> > >>>>>>> of computations.
> >> > >>>>>>>
> >> > >>>>>>> I understand that there are better options for neural networks
> >> than
> >> > >>>>>>>
> >> > >>>>>> map
> >> > >>>
> >> > >>>> reduce.
> >> > >>>>>>> How about non-map-reduce version?
> >> > >>>>>>> I see that you think it is something that would make a sense.
> >> > (Doing
> >> > >>>>>>> a
> >> > >>>>>>> non-map-reduce neural network in Mahout would be
> >>  of
> >>  substantial
> >> > >>>>>>> interest.)
> >> > >>>>>>> Do you think it will be a valueable contribution?
> >> > >>>>>>> Is there a need for this type of algorithm?
> >> > >>>>>>> I think about multi-threded batch gradient descent with
> >> pretraining
> >> > >>>>>>>
> >> > >>>>>> (RBM
> >> > >>>
> >> > >>>> or/and Autoencoders).
> >> > >>>>>>>
> >> > >>>>>>> I have looked into these old JIRAs. RBM patch was withdrawn.
> >> > >>>>>>> "I would rather like to withdraw that patch, because by the
> >> time i
> >> > >>>>>>> implemented it i didn't know that the learning algorithm is
> >>  not
> >> > >>>>>>> suited
> >> > >>>>>>> for
> >> > >>>>>>> MR, so I think there is no point including the patch."
> >> > >>>>>>>
> >> > >>>>>>>
> >> > >>>>>>> was (Author: maciejmazur):
> >> > >>>>>>> I've read the papers. I didn't think about distributed
> network.
> >> I
> >> > had
> >> > >>>>>>>
> >> > >>>>>> in
> >> > >>>
> >> > >>>> mind network that will fit into memory, but will require
> >> significant
> >> > >>>>>>> amount
> >> > >>>>>>> of computations.
> >> > >>>>>>>
> >> > >>>>>>> I understand that there are better options for neural networks
> >> than
> >> > >>>>>>>
> >> > >>>>>> map
> >> > >>>
> >> > >>>> reduce.
> >> > >>>>>>> How about non-map-reduce version?
> >> > >>>>>>> I see that you think it is something that would make a sense.
> >> > >>>>>>> Do you think it will be a valueable contribution?
> >> > >>>>>>> Is there a need for this type of algorithm?
> >> > >>>>>>> I think about multi-threded batch gradient descent with
> >> pretraining
> >> > >>>>>>>
> >> > >>>>>> (RBM
> >> > >>>
> >> >
> >>  >>>> or/and Autoencoders).
> >> > >>>>>>>
> >> > >>>>>>> I have looked into these old JIRAs. RBM patch was withdrawn.
> >> > >>>>>>> "I would rather like to withdraw that patch, because by the
> >> time i
> >> > >>>>>>> implemented it i didn't know that the learning algorithm is
> not
> >> > >>>>>>> suited
> >> > >>>>>>> for
> >> > >>>>>>> MR, so I think there is no point including the patch."
> >> > >>>>>>>
> >> > >>>>>>>    GSOC 2013 Neural network algorithms
> >> > >>>>>>>
> >> > >>>>>>>  -----------------------------------
> >> >
> >>  >>>>>>>>
> >> > >>>>>>>>                    Key: MAHOUT-1426
> >> > >>>>>>>>                    URL: https://issues.apache.org/
> >> > >>>>>>>> jira/browse/MAHOUT-1426
> >> > >>>>>>>>                Project: Mahout
> >> > >>>>>>>>             Issue Type: Improvement
> >> > >>>>>>>>             Components: Classification
> >> > >>>>>>>>
> >>  Reporter: Maciej
> >>  Mazur
> >> > >>>>>>>>
> >> > >>>>>>>> I would like to ask about possibilites of implementing neural
> >> > >>>>>>>> network
> >> > >>>>>>>> algorithms in mahout during GSOC.
> >> > >>>>>>>> There is a classifier.mlp package with neural network.
> >> > >>>>>>>> I can't see neighter RBM  nor Autoencoder in these classes.
> >> > >>>>>>>> There is only one word about Autoencoders in NeuralNetwork
> >> class.
> >> > >>>>>>>> As far as I know Mahout doesn't support convolutional
> networks.
> >> > >>>>>>>> Is it a good idea to implement one of these algorithms?
> >> > >>>>>>>> Is it a
> >>  reasonable amount of work?
> >> > >>>>>>>> How hard is it to get GSOC in Mahout?
> >> > >>>>>>>> Did anyone succeed last year?
> >> > >>>>>>>>
> >> > >>>>>>>>
> >> > >>>>>>>>
> >> > >>>>>>>
> >> > >>>>>>> --
> >> > >>>>>>> This message was sent by Atlassian JIRA
> >> > >>>>>>> (v6.1.5#6160)
> >> > >>>>>>>
> >> > >>>>>>>
> >> > >>>>>>>
> >> > >>>>>>
> >> > >>>>>
> >> > >>>>>
> >> > >>>
> >> >
> >>  >>
> >> > >>
> >> > >>
> >> >
> >>
> >
> >
>

Re: [jira] [Comment Edited] (MAHOUT-1426) GSOC 2013 Neural network algorithms

Reply via email to