Command that selects a random sample of the rows, similar to LIMIT
------------------------------------------------------------------

                 Key: PIG-795
                 URL: https://issues.apache.org/jira/browse/PIG-795
             Project: Pig
          Issue Type: New Feature
          Components: impl
            Reporter: Eric Gaudet
            Priority: Trivial


When working with very large data sets (imagine that!), running a pig script 
can take time. It may be useful to run on a small subset of the data in some 
situations (eg: debugging / testing, or to get fast results even if less 
accurate.) 

The command "LIMIT N" selects the first N rows of the data, but these are not 
necessarily randomzed. A command "SAMPLE X" would retain the row only with the 
probability x%.

Note: it is possible to implement this feature with FILTER BY and an UDF, but 
so is LIMIT, and limit is built-in.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to