Owen O'Malley wrote:
Of course, once we allow user-defined InputSplits we will be back in exactly the same boat of running user-code on the JobTracker, unless we also ship over the preferred hosts for each InputFormat too.

So, to entirely avoid user code in the job tracker we'd need a final class that represents each task to be created, a SplitLocations. These would correspond 1-1 to splits, but would only contain the list of preferred hosts. A way to implement this might be to write two parallel files in DFS, one with the SplitLocations, and one with the Splits. Then the first is passed to the job tracker with the name of the second file. Then only task child processes would open the split file, seeking to the appropriate index. We could use ArrayFile for these, and highly replicate them, especially their indexes.

Doug

Reply via email to