Thanks for replying. It is much clear now. One more thing to ask about the third question is, how to allocate reducers to several hot keys? Hashing? Further, Pig doesn't divide the reducers into hot-key reducers and non-hot-key reducers, is it right?
Thanks, -Gang ----- 原始邮件 ---- 发件人: Alan Gates <ga...@yahoo-inc.com> 收件人: pig-dev@hadoop.apache.org 发送日期: 2010/6/16 (周三) 12:16:13 下午 主 题: Re: skew join in pig On Jun 16, 2010, at 8:36 AM, Gang Luo wrote: > Hi, > there is something confusing me in the skew join > (http://wiki.apache.org/pig/PigSkewedJoinSpec) > 1. does the sampling job sample and build histogram on both tables, or just > one table (in this case, which one) ? Just the left one. > 2. the join job still take the two table as inputs, and shuffle tuples from > partitioned table to particular reducer (one tuple to one reducer), and > shuffle tuples from streamed table to all reducers associative to one > partition (one tuple to multiple reducers). Is that correct? Keys with small enough values to fit in memory are shuffled to reducers as normal. Keys that are too large are split between reducers on the left side, and replicated to all of those reducers that have the splits (not all reducers) on the right side. Does that answer your question? > 3. Hot keys need more than one reducers. Are these reducers dedicated to this > key only? Could they also take other keys at the same time? They take other keys at the same time. > 4. for non-hot keys, my understanding is that they are shuffled to reducers > based on default hash partitioner. However, it could happen all the keys > shuffled to one reducers incurs skew even none of them is skewed individually. This is always the case in map reduce, though a good hash function should minimize the occurrences of this. > > Can someone give me some ideas on these? Thanks. > > -Gang > > > Alan.