Sampling job in Pig is used in "order by" and "skewed join". It will be translated to a single map-reduce job. In the map, we sample the data with a configurable interval; in the reduce, we do a "group all" followed by a nested foreach. Within foreach, we do a nested sort and then feed the result to UDF ("order by" and "skewed join" use different UDF)

In PIG-1038, we will optimize nested sort using hadoop secondary sort if possible. Sampling job fits in the bill. So PIG-841 is fixed automatically.

Daniel

On 03/05/2011 12:54 PM, Renato Marroquín Mogrovejo wrote:
Hey does anybody know if PIG-841 was developed? And if it was, how is it
being used by Pig?
Thanks in advance.

Renato M.

Reply via email to