I'm trying to do a map-side join using CompositeInputFormat. I reand int book "Hadoop Definitive Guide"that I must follow certain conditions: "Each input dataset must be divided into the same number of partitions, and it must be sorted by the same key (the join key) in each source. All the records for the private key must reside in the same partition. This may sound like the strict requirement (and it is), but it actually fits the description of the output of a MapReduce job. "
I really need to have all records from a particular key within the same partition? Does Hadoop will assign a map task for each partition file? I tried to meet these conditions using the ORDER BY from PIG latin, but the function does not put all records with the same key within the same partition. http://stackoverflow.com/questions/21668974/apache-pig-does-order-by-with-parallel-ensure-consistent-hashing-distribution How do I meet this condition? Do I need to create a Identity Mapper Reducer job just to make this task ? Thanks!!!