I am trying to run an application where I try to generate the cartesion product of two potentially large data sets. In reality I only need the cartesian product of values in the set with a particular integer key. I am considering a design where the first mappers run through the values of set A emitting that integer as a key and the item as a value. The reducers are simple identity reducers. In the second job the mappers run through set B emitting values with a key and the item as a value. The reducers read the output of the first job to run through the values of A. One issue is that assuming the same hashing partitioner is used there are the same number of reducers, a specific reducer , say reducer 12 , will receive the same keys in both jobs and thus part-r-00012 from the first job is the only file reducer 12 will need to read. Can I guarantee (without restricting the number of reducers to a smaller number than the cluster will support) that this condition is met - namely that the keys in the second job hit the same reducer number as the first job? What about restarts and failures? BTW is there any way to find out the size of a cluster??
-- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com