Re: [google-appengine] data Join using MapReduce/pipeline api

Bing Wei Tue, 24 May 2011 22:07:12 -0700

Thanks, Brett.

I am still confused about the join process. I append the results of two maps
together, but get exceptions when I run the Shuffler.
The result of a map is a list. But the two map results have different
length. Let's say the length of the first map result is 3, the length of the
second map result is 7. They are something like this:
[ [a1, b1, c1], [a2, b2, c2], [a3, b3, c3] ]
[ [d1, e1, f1, g1, h1, j1, k1], [d2, e2, f2, g2, h2, j2, k2], ... [d7,
...,k7] ]
The append result would be:
[ [a1, b1, c1],..., [a3, b3, c3], [d1, e1, f1, g1, h1, j1, k1], ... [d7,
...,k7] ]


Then I get exception as follows:

File
"/root/pigappscale/appEngine/actionlogtests/mapreduce/mapreduce_pipeline.py",
line 113, in run
    shuffled_shards[i].append(filename)
IndexError: list index out of range

Seems the problem result from different sizes of the map results.

By referenceProperty, I mean we can use it in one entity which reference to
another entity to implement join of the two entities. For example, we have a
Sales Entity and a Customer Entity. In the Sales Entity, we have a user
field, which is a referenceProperty and reference to the Customer Entity.
Then we don't need to join the two 'tables', but use sales.customer to
implement the equivalent function as join.


On Tue, May 24, 2011 at 12:41 PM, Brett Slatkin
<[email protected]>wrote:

> On Mon, May 23, 2011 at 2:06 PM, Bing <[email protected]> wrote:
> >
> > In the google io talk, data join is implemented by Append method. But
> > it seems the Append method is only to append lists together. Is that
> > Append method just a high-level concept or is there an implementation?
> >
> > Also, join can be implemented by using referenceProperty. It is not
> > necessary to do the map first, and append sets of the map results
> > together.
>
> Append is in here:
>
>
> http://code.google.com/p/appengine-pipeline/source/browse/trunk/src/pipeline/common.py
>
> The idea is you would append the inputs together, then run Shuffle on
> them in combination. Shuffle is what actually does the join. You can
> find Shuffle here:
>
>
> http://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/shuffler.py
>
> As for the referenceProperty thing, I'm not sure exactly what you
> mean, but generally if you have to do any explicit queries in a map or
> reduce phase you are going to hit scalability problems. The job needs
> to run at full speed with minimal latency for each mapper or reduce
> input.
>
> -Brett
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>
>


-- 
Bing

Graduate Student
Computer Science Department, UCSB :)

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] data Join using MapReduce/pipeline api

Reply via email to