Sampling job in Pig is used in "order by" and "skewed join". It will be
translated to a single map-reduce job. In the map, we sample the data
with a configurable interval; in the reduce, we do a "group all"
followed by a nested foreach. Within foreach, we do a nested sort and
then feed the result to UDF ("order by" and "skewed join" use different UDF)
In PIG-1038, we will optimize nested sort using hadoop secondary sort if
possible. Sampling job fits in the bill. So PIG-841 is fixed automatically.
Daniel
On 03/05/2011 12:54 PM, Renato Marroquín Mogrovejo wrote:
Hey does anybody know if PIG-841 was developed? And if it was, how is it
being used by Pig?
Thanks in advance.
Renato M.