I have read "Parallelized stochastic gradient descent" (2010) by
Martin A. Zinkevich et al.
the parallel sgd is very simple:
Define T = ⌊m/k⌋
Randomly partition the examples, giving T examples to each machine.
for all i ∈ {1, . . . k} parallel do
Randomly shuffle the data on machine i.
Initialize wi,0 = 0.
for all t ∈ {1, . . . T }: do
Get the tth example on the ith machine (this machine), ci,t
wi,t ← wi,t−1 − η∂w ci (wi,t−1 )
end for
end for
Aggregate from all computers v = k i=1 wi,t and return v.
it assumes that each machine do sgd optimization on the data locally
and randomly shuffle the data on this machine.
it seems each machine has to load all the local data into memory and
shuffle to perform sgd
then average them
how to do this in hadoop?
1. how to control hadoop input split size .
let hadoop do this for me? but each split should be not too much
that can't be loaded into memory
2. do batch?
in setUp of Mapper, construct a data structure to store all data
of this split
int mapper, just add data to this data structure
int close method, do the real job of sgd
is my method feasible?