If you have scaling problems, check out the Mahout project. They are all about distributed scalable linear algebra & more. http://mahout.apache.org
Lance On Wed, Jun 22, 2011 at 5:13 PM, Jason <urg...@gmail.com> wrote: > I remember I had a similar problem. > The way I approached it was by partitioning one of the data sets. At high > level these are the steps: > > Suppose you decide to partition set A. > > Each partition represents a subset/range of the A keys and must be small > enough to fit records in memory. > > Each partition gets sent to a separate reducer by the mapper and partitioner > logic. > > The second data set B then is *duplicated* for each of the reducers again > using some trivial logic in mapper and partitioner. > > This assumes that the reducers can process record from both A and B sets. > Also all records from A preceed ones from B which is trivially done by sort > comparator. > > When a reducer receives a record from A set, it stores it in memory. > When a record from set B arrives, the cross product is computed with all A > records already in memory and results are emitted. > > The job should scale in space as long as you have enough reducers assigned > and will scale in time with more reducer machines. > > > Sent from my iPhone > > On Jun 22, 2011, at 3:16 PM, Steve Lewis <lordjoe2...@gmail.com> wrote: > >> Assume I have two data sources A and B >> Assume I have an input format and can generate key values for both A and B >> I want an algorithm which will generate the cross product of all values in A >> having the key K and all values in B having the >> key K. >> Currently I use a mapper to generate key values for A and have the reducer >> get all values in B with key K and hold them in memory. >> It works but might not scale. >> >> Any bright ideas? >> >> -- >> Steven M. Lewis PhD >> 4221 105th Ave NE >> Kirkland, WA 98033 >> 206-384-1340 (cell) >> Skype lordjoe_com >> >> > -- Lance Norskog goks...@gmail.com