Hello all,
I need to do some calculations that has to merge two sets of very large
data (basically calculate variance).
One set contains a set of "means" and the second a set of objects tied
to a mean.
Normally I would send the set of means using the distributed cache, but
the set has become too large to keep in memory and it is going to grow
in the future.
I would like to join the two data files so that each mapper gets the
entries of both files with the same keys. I have seen there is a
CompositeInputFormat but there is no real documentation on it.
Can anyone enlighten me on whether it would be useful and how it works.
Cheers,
Christian