On Thu, Jan 28, 2010 at 2:06 PM, Markus Weimer <[email protected]> wrote:
> Hi Jake > > > Well let me see what we would imagine is going on: your > > data lives all over HDFS, because it's nice and big. The > > algorithm wants to run over the set in a big streamy fashion. > > Yes, that sums it up nicely. The important part is to be able to > stream over the data several times. > Yeah, I recently committed a stream-oriented SVD impl which ideally would do the same thing - it's non-parallelizable, but fast on a streaming set (although I'm not sure how long it would take to converge on a set which was so big that it didn't fit on a single box... that's mighty big for a non-parallelized algorithm...) > > You clearly don't want to move your multi-TB dataset > > around, but moving the 0.5GB model state around is > > ok, yes? > > Yes! That is an awesome idea to minimize the I/O on the system. > However, it does not yet address the issue of multiple passes over the > data. But that can easily be done by handing around the 0.5GB another > time. > I could store the 0.5GB on HDFS, where it is read from the mapper in > setup(), updated in map() over the data and stored again in cleanup(), > together with some housekeeping data about which InputSplit was > processed. The driver program could then orchestrate the different > mappers and manage the global state of this procedure. Now I only need > to figure out how to do this ;-) > Yeah, the nice thing about this hack is that you could reorder the shards sometimes too, on any given pass, if order of input points wasn't important (why would you want to do this? maybe the node which has some of your data is heavily loaded by someone else's single-point-of- failure reducer, so you instead save that shard for later. Not sure how you do that on Hadoop, but it would be cool!). Since we've got some SGD coming down the pipe in Mahout, in particular, we'd love to hear how you end up going with this! -jake
