Re: Multiple data-local passes?

Markus Weimer Thu, 28 Jan 2010 16:56:16 -0800

Hi,

> No.  I was referring to the fact that the locality of any given large file
> is about the same from any node because there are many blocks and they get
> spattered all over.


Yes, that is true when referring to my comment of running the code on
one carefully chosen box. I agree that there is no such box that is
"closer" to the data on average.

What Jake & I were discussing is not to find this one box, but to use
multiple boxes for the learning, but sequentially: The code is sent to
the machine that holds the block of the data we are interested in, and
there is always one machine that fulfills this criterion. Once that
machine is done with learning, the algorithm & current model is
shipped to the machine which holds the next block of the file. That
way, the model gets shipped through the network, but not the data.

A somewhat hacky way to implement this is to split the input file into
files of exactly one block in size and then submitting one map job per
file.

Markus

Re: Multiple data-local passes?

Reply via email to