Re: UDF for generating top xx % of results?

Thejas Nair Wed, 30 Jun 2010 14:45:08 -0700

On 6/30/10 9:02 AM, "hc busy" <[email protected]> wrote:

> @Thejas  I had thought that Limit is distributed and does not guarantee u
> get the results in order ??
> 

As mentioned under section on limit here -
http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#LIMIT
" There is no guarantee which tuples will be returned, and the tuples that
are returned can change from one run to the next. A particular set of tuples
can be requested using the ORDER operator followed by LIMIT. "

They query -
Set default_parallel 10;
L = load 'x';
O = order L by $0;
LIM = limit O 100;

Will result in 3 MR jobs.  (see explain output for details)
1st MR - sampling MR job for order-by to determine the distribution on sort
key and decide how to partition the data for ordering
2nd MR - orders the result. Each reduce task will output only first 100
records. 
3rd MR - does the final limit - map reads with sort key as the key, it has a
single reducer task that reads the first 100 records.

Re: UDF for generating top xx % of results?

Reply via email to