The approach you suggest is similar to what I am currently doing but it requires you to size the partitions to the memory available on the reducer. This is a non-trivial task and is not necessarily guaranteed to scale. It is true that the simplest approach is to break one of the sets into sufficiently small partitions to hold a partition in memory and then generate the Cartesian product but it is a hack and makes assumptions about partition size. One elegant solution would involve an ability to restart one of the input splitters and replay the input data from set A several times until the mapper had generated all sets of the form <key,(ai,bj>
On Wed, Jun 22, 2011 at 5:13 PM, Jason <urg...@gmail.com> wrote: > I remember I had a similar problem. > The way I approached it was by partitioning one of the data sets. At high > level these are the steps: > > Suppose you decide to partition set A. > > Each partition represents a subset/range of the A keys and must be small > enough to fit records in memory. > > Each partition gets sent to a separate reducer by the mapper and > partitioner logic. > > The second data set B then is *duplicated* for each of the reducers again > using some trivial logic in mapper and partitioner. > > This assumes that the reducers can process record from both A and B sets. > Also all records from A preceed ones from B which is trivially done by sort > comparator. > > When a reducer receives a record from A set, it stores it in memory. > When a record from set B arrives, the cross product is computed with all A > records already in memory and results are emitted. > > The job should scale in space as long as you have enough reducers assigned > and will scale in time with more reducer machines. > > > Sent from my iPhone > > On Jun 22, 2011, at 3:16 PM, Steve Lewis <lordjoe2...@gmail.com> wrote: > > > Assume I have two data sources A and B > > Assume I have an input format and can generate key values for both A and > B > > I want an algorithm which will generate the cross product of all values > in A having the key K and all values in B having the > > key K. > > Currently I use a mapper to generate key values for A and have the > reducer get all values in B with key K and hold them in memory. > > It works but might not scale. > > > > Any bright ideas? > > > > -- > > Steven M. Lewis PhD > > 4221 105th Ave NE > > Kirkland, WA 98033 > > 206-384-1340 (cell) > > Skype lordjoe_com > > > > > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com