On Thu, Jun 18, 2009 at 01:36:14PM -0700, Owen O'Malley wrote: > On Jun 18, 2009, at 10:56 AM, pmg wrote: > > >Each line from FileA gets compared with every line from FileB1, > >FileB2 etc. > >etc. FileB1, FileB2 etc. are in a different input directory > > In the general case, I'd define an InputFormat that takes two > directories, computes the input splits for each directory and > generates a new list of InputSplits that is the cross-product of the > two lists. So instead of FileSplit, it would use a FileSplitPair that > gives the FileSplit for dir1 and the FileSplit for dir2 and the record > reader would return a TextPair with left and right records (ie. > lines). Clearly, you read the first line of split1 and cross it by > each line from split2, then move to the second line of split1 and > process each line from split2, etc. >
Out of curiosity, how does Hadoop schedule tasks when a task needs multiple inputs and the data for a task is on different nodes? How does it decide which node will be more "local" and should have the task steered to it? -Erik