Command that selects a random sample of the rows, similar to LIMIT
------------------------------------------------------------------
Key: PIG-795
URL: https://issues.apache.org/jira/browse/PIG-795
Project: Pig
Issue Type: New Feature
Components: impl
Reporter: Eric Gaudet
Priority: Trivial
When working with very large data sets (imagine that!), running a pig script
can take time. It may be useful to run on a small subset of the data in some
situations (eg: debugging / testing, or to get fast results even if less
accurate.)
The command "LIMIT N" selects the first N rows of the data, but these are not
necessarily randomzed. A command "SAMPLE X" would retain the row only with the
probability x%.
Note: it is possible to implement this feature with FILTER BY and an UDF, but
so is LIMIT, and limit is built-in.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.