Re: joining two large files in hadoop

Ken Krugler Sat, 04 Apr 2009 14:37:30 -0700

I need to do some calculations that has to merge two sets of verylarge data (basically calculate variance).One set contains a set of "means" and the second a set of objectstied to a mean.
Normally I would send the set of means using the distributed cache,but the set has become too large to keep in memory and it is goingto grow in the future.

You might want to check out Cascading (http://www.cascading.org),which is an API for doing data processing on Hadoop - it has supportfor SQL-style joins (sounds like what you want) via its CoGroup pipe.


-- Ken
--
Ken Krugler
+1 530-210-6378

Re: joining two large files in hadoop

Reply via email to