> One elegant solution would involve an ability to restart one of the input > splitters and replay the input data from set A several times until the mapper > had generated all sets of the form <key,(ai,bj>
By replaying A several times, did you actually mean reading A set B times? Cause in this case you may end up with a lot of IO use depending on size of your sets and a size of a single record. With partitioning approach, IO reads are naturally reduced by choosing partition size larger. Sent from my iPhone On Jun 23, 2011, at 10:07 AM, Steve Lewis <lordjoe2...@gmail.com> wrote: > The approach you suggest is similar to what I am currently doing but it > requires you to size the partitions to the memory available on the reducer. > This is a non-trivial task and is not necessarily guaranteed to scale. It is > true that the simplest approach is to break one of the sets into sufficiently > small partitions to hold a partition in memory and then generate the > Cartesian product but it is > a hack and makes assumptions about partition size. > One elegant solution would involve an ability to restart one of the input > splitters and replay the input data from set A several times until the mapper > had generated all sets of the form <key,(ai,bj> > > > On Wed, Jun 22, 2011 at 5:13 PM, Jason <urg...@gmail.com> wrote: > I remember I had a similar problem. > The way I approached it was by partitioning one of the data sets. At high > level these are the steps: > > Suppose you decide to partition set A. > > Each partition represents a subset/range of the A keys and must be small > enough to fit records in memory. > > Each partition gets sent to a separate reducer by the mapper and partitioner > logic. > > The second data set B then is *duplicated* for each of the reducers again > using some trivial logic in mapper and partitioner. > > This assumes that the reducers can process record from both A and B sets. > Also all records from A preceed ones from B which is trivially done by sort > comparator. > > When a reducer receives a record from A set, it stores it in memory. > When a record from set B arrives, the cross product is computed with all A > records already in memory and results are emitted. > > The job should scale in space as long as you have enough reducers assigned > and will scale in time with more reducer machines. > > > Sent from my iPhone > > On Jun 22, 2011, at 3:16 PM, Steve Lewis <lordjoe2...@gmail.com> wrote: > > > Assume I have two data sources A and B > > Assume I have an input format and can generate key values for both A and B > > I want an algorithm which will generate the cross product of all values in > > A having the key K and all values in B having the > > key K. > > Currently I use a mapper to generate key values for A and have the reducer > > get all values in B with key K and hold them in memory. > > It works but might not scale. > > > > Any bright ideas? > > > > -- > > Steven M. Lewis PhD > > 4221 105th Ave NE > > Kirkland, WA 98033 > > 206-384-1340 (cell) > > Skype lordjoe_com > > > > > > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com > >