> One elegant solution would involve an ability to restart one of the input 
> splitters and replay the input data from set A several times until the mapper 
> had generated all sets of the form <key,(ai,bj>

By replaying A several times, did you actually  mean reading A set B times?
Cause in this case you may end up with a lot of IO use depending on size of 
your sets and a size of a single record. With partitioning approach, IO reads 
are naturally reduced by choosing partition size larger.

Sent from my iPhone

On Jun 23, 2011, at 10:07 AM, Steve Lewis <lordjoe2...@gmail.com> wrote:

> The approach you suggest is similar to what I am currently doing but it 
> requires you to size the partitions to the memory available on the reducer. 
> This is a non-trivial task and is not necessarily guaranteed to scale. It is 
> true that the simplest approach is to break one of the sets into sufficiently 
> small partitions to hold a partition in memory and then generate the 
> Cartesian product but it is 
> a hack and makes assumptions about partition size. 
> One elegant solution would involve an ability to restart one of the input 
> splitters and replay the input data from set A several times until the mapper 
> had generated all sets of the form <key,(ai,bj>
> 
> 
> On Wed, Jun 22, 2011 at 5:13 PM, Jason <urg...@gmail.com> wrote:
> I remember I had a similar problem.
> The way I approached it was by partitioning one of the data sets. At high 
> level these are the steps:
> 
> Suppose you decide to partition set A.
> 
> Each partition represents a subset/range of the A keys and must be small 
> enough to fit records in memory.
> 
> Each partition gets sent to a separate reducer by the mapper and partitioner 
> logic.
> 
> The second data set B then is *duplicated* for each of the reducers again 
> using some trivial logic in mapper and partitioner.
> 
> This assumes that the reducers can process record from both A and B sets. 
> Also all records from A preceed ones from B which is trivially done by sort 
> comparator.
> 
> When a reducer receives a record from A set, it stores it in memory.
> When a record from set B arrives, the cross product is computed with all A 
> records already in memory and results are emitted.
> 
> The job should scale in space as long as you have enough reducers assigned 
> and will scale in time with more reducer machines.
> 
> 
> Sent from my iPhone
> 
> On Jun 22, 2011, at 3:16 PM, Steve Lewis <lordjoe2...@gmail.com> wrote:
> 
> > Assume I have two data sources A and B
> > Assume I have an input format and can generate key values for both A and B
> > I want an algorithm which will generate the cross product of all values in 
> > A having the key K and all values in B having the
> > key K.
> > Currently I use a mapper to generate key values for A and  have the reducer 
> > get all values in B with key K and hold them in memory.
> > It works but might not scale.
> >
> > Any bright ideas?
> >
> > --
> > Steven M. Lewis PhD
> > 4221 105th Ave NE
> > Kirkland, WA 98033
> > 206-384-1340 (cell)
> > Skype lordjoe_com
> >
> >
> 
> 
> 
> -- 
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
> 
> 

Reply via email to