Hi, > No. I was referring to the fact that the locality of any given large file > is about the same from any node because there are many blocks and they get > spattered all over.
Yes, that is true when referring to my comment of running the code on one carefully chosen box. I agree that there is no such box that is "closer" to the data on average. What Jake & I were discussing is not to find this one box, but to use multiple boxes for the learning, but sequentially: The code is sent to the machine that holds the block of the data we are interested in, and there is always one machine that fulfills this criterion. Once that machine is done with learning, the algorithm & current model is shipped to the machine which holds the next block of the file. That way, the model gets shipped through the network, but not the data. A somewhat hacky way to implement this is to split the input file into files of exactly one block in size and then submitting one map job per file. Markus
