I need to do some calculations that has to merge two sets of very
large data (basically calculate variance).
One set contains a set of "means" and the second a set of objects
tied to a mean.
Normally I would send the set of means using the distributed cache,
but the set has become too large to keep in memory and it is going
to grow in the future.
You might want to check out Cascading (http://www.cascading.org),
which is an API for doing data processing on Hadoop - it has support
for SQL-style joins (sounds like what you want) via its CoGroup pipe.
-- Ken
--
Ken Krugler
+1 530-210-6378