Hi,
there is something confusing me in the skew join 
(http://wiki.apache.org/pig/PigSkewedJoinSpec)
1. does the sampling job sample and build histogram on both tables, or just one 
table (in this case, which one) ?
2. the join job still take the two table as inputs, and shuffle tuples from 
partitioned table to particular reducer (one tuple to one reducer), and shuffle 
tuples from streamed table to all reducers associative to one partition (one 
tuple to multiple reducers). Is that correct?
3. Hot keys need more than one reducers. Are these reducers dedicated to this 
key only? Could they also take other keys at the same time?
4. for non-hot keys, my understanding is that they are shuffled to reducers 
based on default hash partitioner. However, it could happen all the keys 
shuffled to one reducers incurs skew even none of them is skewed individually.  

Can someone give me some ideas on these? Thanks.

-Gang



Reply via email to