"Yielding equal partitions" means that each input source will offer n
partitions and for any given partition 0 <= i < n, the records in that
partition are 1) sorted on the same key 2) unique to that partition,
i.e. if a key k is in partition i for a given source, k appears in no
other partitions from that source and if any other source contains k,
all occurrences appear in partition i from that source. All the
framework really effects is the cartesian product of all matching
keys, so yes, that implies equi-joins.
It's a fairly strict requirement. Satisfying it is less onerous if one
is joining the output of several m/r jobs, each of which uses the same
keys/partitioner, the same number of reduces, and each output file
(part-xxxxx) of each job is not splittable. In this case, n is equal
to the number of output files from each job (the number of reduces),
(1) is satisfied if the reduce emits records in the same order (i.e.
no new keys, no records out of order), and (2) is guaranteed by the
partitioner and (1).
An InputFormat capable of parsing metadata about each source to
generate partitions from the set of input sources is ideal, but I can
point to no existing implementation. -C
On Jul 14, 2008, at 9:20 AM, Kevin wrote:
Hi,
I find limited information about this package which looks like could
do "equi?" join. "Given a set of sorted datasets keyed with the same
class and yielding equal partitions, it is possible to effect a join
of those datasets prior to the map. " What does "yielding equal
partitions" mean?
Thank you.
-Kevin