but I can't control the inputsplit. what I need is: 1. split input data to small blocks whose size is defined by me(e.g. the maximum number training instances I machine can deal with). 2. randomly dispatch these small blocks to each machine(may be hdfs can do this for me) 3. each mapper deal with a small block
On Sat, Mar 29, 2014 at 3:55 AM, Ted Dunning <[email protected]> wrote: > Yes. That is feasible. > > I think that you would have better luck with something like asynchronous > SGD as described here: > > http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2012_0598.pdf > > and here > > http://www.cs.toronto.edu/~fritz/absps/georgerectified.pdf > > It would also be good to consider looking at some of the new scala work in > Mahout. Map-reduce is a difficult medium for this art. > > > > > On Fri, Mar 28, 2014 at 5:21 AM, Li Li <[email protected]> wrote: > >> I have read "Parallelized stochastic gradient descent" (2010) by >> Martin A. Zinkevich et al. >> the parallel sgd is very simple: >> >> Define T = ⌊m/k⌋ >> Randomly partition the examples, giving T examples to each machine. >> for all i ∈ {1, . . . k} parallel do >> Randomly shuffle the data on machine i. >> Initialize wi,0 = 0. >> for all t ∈ {1, . . . T }: do >> Get the tth example on the ith machine (this machine), ci,t >> wi,t ← wi,t−1 − η∂w ci (wi,t−1 ) >> end for >> end for >> Aggregate from all computers v = k i=1 wi,t and return v. >> >> it assumes that each machine do sgd optimization on the data locally >> and randomly shuffle the data on this machine. >> >> it seems each machine has to load all the local data into memory and >> shuffle to perform sgd >> then average them >> >> how to do this in hadoop? >> >> 1. how to control hadoop input split size . >> let hadoop do this for me? but each split should be not too much >> that can't be loaded into memory >> 2. do batch? >> in setUp of Mapper, construct a data structure to store all data >> of this split >> int mapper, just add data to this data structure >> int close method, do the real job of sgd >> >> is my method feasible? >>
