On Thu, Jan 28, 2010 at 3:01 PM, Ted Dunning <[email protected]> wrote:
> Aha.... > > If you are running SGD on a single node, just open the HDFS files directly. > You won't have significant benefit to locality unless the files are > relatively small. > You mean relatively large, right? > With a single node solution, you gain little from Hadoop. The need for > restarts and such really provide large advantage when you have many nodes > participating in the computation. > If he's got an N-node cluster, and each 1/N's worth of his data set takes more than maybe 5-10 minutes to process per pass, then the overhead would be fairly minimal. Of course, knowing how fast SGD is, there'd need to be a lot of data to take 10 minutes to process only a fraction of a single pass through... Hadoop isn't doing real parallism via this approach, but is sending your process to where your data is, which is a lot better than opening up a hook into one big HDFS stream and slurping down the entire set locally, I'd imagine, given that he says that network latency is the bottleneck when he streams data. -jake
