Re: skew join in pig

Alan Gates Fri, 18 Jun 2010 11:47:00 -0700

Are you asking how many reducers are used to split a hot key? If so,the answer is as many as we estimate it will take to make the therecords for the key fit into memory. For example, if we have a keywhich we estimate has 10 million records, each record being about 100bytes and for each reduce task we have 400M available, then we willallocate 3 reducers for that hot key. We do not need to take intoaccount any other keys sent to this reducer because reducers processrows one key at a time.


Alan.


On Jun 16, 2010, at 11:51 AM, Gang Luo wrote:

Thanks for replying. It is much clear now. One more thing to askabout the third question is, how to allocate reducers to several hotkeys? Hashing? Further, Pig doesn't divide the reducers into hot-keyreducers and non-hot-key reducers, is it right?
Thanks,
-Gang


----- 原始邮件 ----
发件人： Alan Gates <[email protected]>
收件人： [email protected]
发送日期： 2010/6/16 (周三) 12:16:13 下午
主   题： Re: skew join in pig


On Jun 16, 2010, at 8:36 AM, Gang Luo wrote:
Hi,
there is something confusing me in the skew join (http://wiki.apache.org/pig/PigSkewedJoinSpec)1. does the sampling job sample and build histogram on both tables,or just one table (in this case, which one) ?
Just the left one.
2. the join job still take the two table as inputs, and shuffletuples from partitioned table to particular reducer (one tuple toone reducer), and shuffle tuples from streamed table to allreducers associative to one partition (one tuple to multiplereducers). Is that correct?
Keys with small enough values to fit in memory are shuffled toreducers as normal. Keys that are too large are split betweenreducers on the left side, and replicated to all of those reducersthat have the splits (not all reducers) on the right side. Doesthat answer your question?
3. Hot keys need more than one reducers. Are these reducersdedicated to this key only? Could they also take other keys at thesame time?
They take other keys at the same time.
4. for non-hot keys, my understanding is that they are shuffled toreducers based on default hash partitioner. However, it couldhappen all the keys shuffled to one reducers incurs skew even noneof them is skewed individually.
This is always the case in map reduce, though a good hash functionshould minimize the occurrences of this.
Can someone give me some ideas on these? Thanks.

-Gang
Alan.

Re: skew join in pig

Reply via email to