On Thu, Jan 28, 2010 at 1:37 PM, Markus Weimer <[email protected]> wrote: > > > > > How does network bandwidth come into play in a "local" solution? > > Data may not fit on one disk and must be streamed through the network > to the learning algorithm. If the data does indeed fit onto one disk, > the algorithm becomes disk bandwidth bound. >
Ok, understand this part now, ok. > There is no parallelism to be exploited: I'm doing SGD-style learning. > As the parallelization thereof is a largely unsolved problem, the > learning is strictly sequential. The desire to run it on a hadoop > cluster stems from the fact that data preprocessing and the > application of the learned model is a perfect fit for it. It would be > neat if the actual learning could be done on the cluster as well, if > only on a single, carefully chosen node close to the data. > Well let me see what we would imagine is going on: your data lives all over HDFS, because it's nice and big. The algorithm wants to run over the set in a big streamy fashion. At any given point if it's done processing local stuff, it can output 0.5GB of state and pick up that somewhere else to continue, is that correct? You clearly don't want to move your multi-TB dataset around, but moving the 0.5GB model state around is ok, yes? It seems like what you'd want to do is pass that state info around your cluster, sequentially using one node at a time to process chunks of your data set, I'm just not sure what sort of non-hacky way there is to do this in Hadoop. Simple hack: split up your set manually into a bunch of smaller (small enough for one disk) non-splittable files, and then have the same job get repeated over and over again (with different input sources), each time it finishes it outputs state to HDFS, and each time it starts, the mapper slurps down the state from HDFS. This latter mini-shuffle is a little inefficient (probably two remote copies are done), but it's a fairly small amount of data that is being transferred, and hopefully IO would no longer be the bottleneck. -jake > Thanks, > > Markus >
