Re: multiple file input

Owen O'Malley Thu, 18 Jun 2009 13:37:05 -0700

On Jun 18, 2009, at 10:56 AM, pmg wrote:

Each line from FileA gets compared with every line from FileB1,FileB2 etc.
etc. FileB1, FileB2 etc. are in a different input directory

In the general case, I'd define an InputFormat that takes twodirectories, computes the input splits for each directory andgenerates a new list of InputSplits that is the cross-product of thetwo lists. So instead of FileSplit, it would use a FileSplitPair thatgives the FileSplit for dir1 and the FileSplit for dir2 and the recordreader would return a TextPair with left and right records (ie.lines). Clearly, you read the first line of split1 and cross it byeach line from split2, then move to the second line of split1 andprocess each line from split2, etc.

You'll need to ensure that you don't overwhelm the system with eithertoo many input splits (ie. maps). Also don't forget that N^2/M growsmuch faster with the size of the input (N) than the M machines canhandle in a fixed amount of time.

Two input directories

1. input1 directory with a single file of 600K records - FileA
2. input2 directory segmented into different files with 2Millionrecords -
FileB1, FileB2 etc.

In this particular case, it would be right to load all of FileA intomemory and process the chunks of FileB/part-*. Then it would be muchfaster than needing to re-read the file over and over again, butotherwise it would be the same.


-- Owen

Re: multiple file input

Reply via email to