On Thu, Jan 28, 2010 at 3:01 PM, Ted Dunning <[email protected]> wrote:

> Aha....
>
> If you are running SGD on a single node, just open the HDFS files directly.
> You won't have significant benefit to locality unless the files are
> relatively small.
>

You mean relatively large, right?


> With a single node solution, you gain little from Hadoop.  The need for
> restarts and such really provide large advantage when you have many nodes
> participating in the computation.
>

If he's got an N-node cluster, and each 1/N's worth of his data set takes
more than maybe 5-10 minutes to process per pass, then the overhead
would be fairly minimal.  Of course, knowing how fast SGD is, there'd
need to be a lot of data to take 10 minutes to process only a fraction of
a single pass through...

Hadoop isn't doing real parallism via this approach, but is sending
your process to where your data is, which is a lot better than opening
up a hook into one big HDFS stream and slurping down the entire set
locally, I'd imagine, given that he says that network latency is the
bottleneck when he streams data.

  -jake

Reply via email to