If you have scaling problems, check out the Mahout project. They are
all about distributed scalable linear algebra & more.
http://mahout.apache.org

Lance

On Wed, Jun 22, 2011 at 5:13 PM, Jason <urg...@gmail.com> wrote:
> I remember I had a similar problem.
> The way I approached it was by partitioning one of the data sets. At high 
> level these are the steps:
>
> Suppose you decide to partition set A.
>
> Each partition represents a subset/range of the A keys and must be small 
> enough to fit records in memory.
>
> Each partition gets sent to a separate reducer by the mapper and partitioner 
> logic.
>
> The second data set B then is *duplicated* for each of the reducers again 
> using some trivial logic in mapper and partitioner.
>
> This assumes that the reducers can process record from both A and B sets. 
> Also all records from A preceed ones from B which is trivially done by sort 
> comparator.
>
> When a reducer receives a record from A set, it stores it in memory.
> When a record from set B arrives, the cross product is computed with all A 
> records already in memory and results are emitted.
>
> The job should scale in space as long as you have enough reducers assigned 
> and will scale in time with more reducer machines.
>
>
> Sent from my iPhone
>
> On Jun 22, 2011, at 3:16 PM, Steve Lewis <lordjoe2...@gmail.com> wrote:
>
>> Assume I have two data sources A and B
>> Assume I have an input format and can generate key values for both A and B
>> I want an algorithm which will generate the cross product of all values in A 
>> having the key K and all values in B having the
>> key K.
>> Currently I use a mapper to generate key values for A and  have the reducer 
>> get all values in B with key K and hold them in memory.
>> It works but might not scale.
>>
>> Any bright ideas?
>>
>> --
>> Steven M. Lewis PhD
>> 4221 105th Ave NE
>> Kirkland, WA 98033
>> 206-384-1340 (cell)
>> Skype lordjoe_com
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to