I need to do some calculations that has to merge two sets of very large data (basically calculate variance). One set contains a set of "means" and the second a set of objects tied to a mean.

Normally I would send the set of means using the distributed cache, but the set has become too large to keep in memory and it is going to grow in the future.

You might want to check out Cascading (http://www.cascading.org), which is an API for doing data processing on Hadoop - it has support for SQL-style joins (sounds like what you want) via its CoGroup pipe.

-- Ken
--
Ken Krugler
+1 530-210-6378

Reply via email to