Re: [google-appengine] data Join using MapReduce/pipeline api

Robert Kluin Wed, 25 May 2011 09:11:08 -0700

On Wed, May 25, 2011 at 01:07, Bing Wei <[email protected]> wrote:
> Thanks, Brett.
> I am still confused about the join process. I append the results of two maps
> together, but get exceptions when I run the Shuffler.
> The result of a map is a list. But the two map results have different
> length. Let's say the length of the first map result is 3, the length of the
> second map result is 7. They are something like this:
> [ [a1, b1, c1], [a2, b2, c2], [a3, b3, c3] ]
> [ [d1, e1, f1, g1, h1, j1, k1], [d2, e2, f2, g2, h2, j2, k2], ... [d7,
> ...,k7] ]
> The append result would be:
> [ [a1, b1, c1],..., [a3, b3, c3], [d1, e1, f1, g1, h1, j1, k1], ... [d7,
> ...,k7] ]
> Then I get exception as follows:
> File
> "/root/pigappscale/appEngine/actionlogtests/mapreduce/mapreduce_pipeline.py",
> line 113, in run
>     shuffled_shards[i].append(filename)
> IndexError: list index out of range
> Seems the problem result from different sizes of the map results.
> By referenceProperty, I mean we can use it in one entity which reference to
> another entity to implement join of the two entities. For example, we have a
> Sales Entity and a Customer Entity. In the Sales Entity, we have a user
> field, which is a referenceProperty and reference to the Customer Entity.
> Then we don't need to join the two 'tables', but use sales.customer to
> implement the equivalent function as join.


This will be extremely inefficient.  Even if 'sales' and 'customers'
are one-to-one, you'll be making a separate fetch to get each customer
entity.



>
> On Tue, May 24, 2011 at 12:41 PM, Brett Slatkin <[email protected]>
> wrote:
>>
>> On Mon, May 23, 2011 at 2:06 PM, Bing <[email protected]> wrote:
>> >
>> > In the google io talk, data join is implemented by Append method. But
>> > it seems the Append method is only to append lists together. Is that
>> > Append method just a high-level concept or is there an implementation?
>> >
>> > Also, join can be implemented by using referenceProperty. It is not
>> > necessary to do the map first, and append sets of the map results
>> > together.
>>
>> Append is in here:
>>
>>
>> http://code.google.com/p/appengine-pipeline/source/browse/trunk/src/pipeline/common.py
>>
>> The idea is you would append the inputs together, then run Shuffle on
>> them in combination. Shuffle is what actually does the join. You can
>> find Shuffle here:
>>
>>
>> http://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/shuffler.py
>>
>> As for the referenceProperty thing, I'm not sure exactly what you
>> mean, but generally if you have to do any explicit queries in a map or
>> reduce phase you are going to hit scalability problems. The job needs
>> to run at full speed with minimal latency for each mapper or reduce
>> input.
>>
>> -Brett
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Google App Engine" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected].
>> For more options, visit this group at
>> http://groups.google.com/group/google-appengine?hl=en.
>>
>
>
>
> --
> Bing
>
> Graduate Student
> Computer Science Department, UCSB :)
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] data Join using MapReduce/pipeline api

Reply via email to