On Wed, May 25, 2011 at 01:07, Bing Wei <[email protected]> wrote: > Thanks, Brett. > I am still confused about the join process. I append the results of two maps > together, but get exceptions when I run the Shuffler. > The result of a map is a list. But the two map results have different > length. Let's say the length of the first map result is 3, the length of the > second map result is 7. They are something like this: > [ [a1, b1, c1], [a2, b2, c2], [a3, b3, c3] ] > [ [d1, e1, f1, g1, h1, j1, k1], [d2, e2, f2, g2, h2, j2, k2], ... [d7, > ...,k7] ] > The append result would be: > [ [a1, b1, c1],..., [a3, b3, c3], [d1, e1, f1, g1, h1, j1, k1], ... [d7, > ...,k7] ] > Then I get exception as follows: > File > "/root/pigappscale/appEngine/actionlogtests/mapreduce/mapreduce_pipeline.py", > line 113, in run > shuffled_shards[i].append(filename) > IndexError: list index out of range > Seems the problem result from different sizes of the map results. > By referenceProperty, I mean we can use it in one entity which reference to > another entity to implement join of the two entities. For example, we have a > Sales Entity and a Customer Entity. In the Sales Entity, we have a user > field, which is a referenceProperty and reference to the Customer Entity. > Then we don't need to join the two 'tables', but use sales.customer to > implement the equivalent function as join.
This will be extremely inefficient. Even if 'sales' and 'customers' are one-to-one, you'll be making a separate fetch to get each customer entity. > > On Tue, May 24, 2011 at 12:41 PM, Brett Slatkin <[email protected]> > wrote: >> >> On Mon, May 23, 2011 at 2:06 PM, Bing <[email protected]> wrote: >> > >> > In the google io talk, data join is implemented by Append method. But >> > it seems the Append method is only to append lists together. Is that >> > Append method just a high-level concept or is there an implementation? >> > >> > Also, join can be implemented by using referenceProperty. It is not >> > necessary to do the map first, and append sets of the map results >> > together. >> >> Append is in here: >> >> >> http://code.google.com/p/appengine-pipeline/source/browse/trunk/src/pipeline/common.py >> >> The idea is you would append the inputs together, then run Shuffle on >> them in combination. Shuffle is what actually does the join. You can >> find Shuffle here: >> >> >> http://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/shuffler.py >> >> As for the referenceProperty thing, I'm not sure exactly what you >> mean, but generally if you have to do any explicit queries in a map or >> reduce phase you are going to hit scalability problems. The job needs >> to run at full speed with minimal latency for each mapper or reduce >> input. >> >> -Brett >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/google-appengine?hl=en. >> > > > > -- > Bing > > Graduate Student > Computer Science Department, UCSB :) > > > > -- > You received this message because you are subscribed to the Google Groups > "Google App Engine" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/google-appengine?hl=en. > -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
