yes i usually follow #2 too. The thing is, pretty often algorithm can define its own set of strategies the backend need to support (like this distributedEngine strategy) and keep a lot of logic still common accross all strategies. But then if all-reduce aggregate operation is incredibly common among such algorithm speicfic strategies, then it stands to reason to implement it only once.
I have an idea. Maybe we need a common distributed strategy which is different from algebraic optimizer. That way we don't have to mess with algebraic rewrites. how about that? On Wed, Nov 12, 2014 at 9:12 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > So you are following #2, which is good. #1 seems a bit like a hack. For a > long time to come we will have to add things to the DSL if it is to be kept > engine independent. Yours looks pretty general and simple. > > Are you familiar with the existing Mahout aggregate methods? They show up > in the SGDHelper.java and other places in legacy code. I don’t know much > about them but they seem to be a pre-functional programming attempt at this > kind of thing. It looks like you are proposing a replacement for those > based on rdd.aggregate, if so that would be very useful. For one thing it > looks like the old aggregate was not parallel, rdd.aggregate is. > > > On Nov 11, 2014, at 1:18 PM, Gokhan Capan <gkhn...@gmail.com> wrote: > > So the alternatives are: > > 1- mapBlock to a matrix whose all rows-but-the first are empty, then > aggregate > 2- depend on a backend > > 1 is obviously OK. > > I don't like the idea of depending on a backend since SGD is a generic loss > minimization, on which other algorithms will possibly depend. > > In this context, client-side aggregation is not an overhead, but even if it > happens to be so, it doesn't have to be a client-side aggregate at all. > > Alternative to 1, I am thinking of at least having an aggregation > operation, which will return an accumulated value anyway, and shouldn't > affect algebra optimizations. > > I quickly implemented a naive one (supporting only Spark- I know I said > that I don't like depending on a backend, but at least the backends-wide > interface is consistent, and as a client, I still don't have to deal with > Spark primitives directly). > > Is this nice enough? Is it too bad to have in the DSL? > https://github.com/gcapan/mahout/compare/accumulateblocks > > Best > > Gokhan > > On Tue, Nov 11, 2014 at 10:45 PM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > > > Oh. algorithm actually collects the vectors and runs another cycle in the > > client! > > > > Still, technically, you can collect almost-empty blocks to the client > > (since they are mostly empty, it won't cause THAT huge overhead compared > to > > collecting single vectors, after all, how many partitions are we talking > > about? 1000? ). > > > > On Tue, Nov 11, 2014 at 12:41 PM, Dmitriy Lyubimov <dlie...@gmail.com> > > wrote: > > > >> > >> > >> On Sat, Nov 8, 2014 at 12:42 PM, Gokhan Capan <gkhn...@gmail.com> > wrote: > >> > >>> Hi, > >>> > >>> Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper ( > >>> http://martin.zinkevich.org/publications/nips2010.pdf), I tried to > >>> implement SGD, and a regularized least squares solution for linear > >>> regression (can easily be extended to other GLMs, too). > >>> > >>> How the algorithm works is as follows: > >>> 1. Split data into partitions of T examples > >>> 2. in parallel, for each partition: > >>> 2.0. shuffle partition > >>> 2.1. initialize parameter vector > >>> 2.2. for each example in the shuffled partition > >>> 2.2.1 update the parameter vector > >>> 3. Aggregate all the parameter vectors and return > >>> > >> > >> I guess technically it is possible (transform each block to a > >> SparseRowMatrix or SparseMatrix with only first valid row) and then > invoke > >> colSums() or colMeans() (whatever aggregate means). > >> > >> However, i am not sure it is worth the ugliness. isn't it easier to > >> declare these things quasi-algebraic and just do direct spark calls on > the > >> matrix RDD (map, aggregate)? > >> > >> The real danger is to introduce non-algebra things into algebra so that > >> the rest of the algebra doesn't optimize any more. > >> > >> > > > >