Hi all, I have a problem where I need to compare each input record to one or more large files (the large files are loaded into memory). Which file the records will need to be compared against varies depending on the input record.
Currently, I have it working, but I doubt it is the optimal way. The way I do it is I created a job to run before the main job, in which each record is emitted one or more times with the file to be compared against later on. I then use the reduce phase to sort by file order, so that in the main job it will not be constantly swapping these large files in and out of memory. This method works well however the first job writes a lot of data to HDFS and takes a long time to run. Given it's relatively simple task I was wondering if there is a better way to do it? Comments appreciated! Kind Regards, Shane
