[ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric Gaudet updated PIG-795: ---------------------------- Attachment: sample3.diff This is the implementation of the SAMPLE operator rewritten as FILTER by the query parser, as suggested by Olga and Alan. It uses a new built-in function RANDOM(), copied from piggybank. This patch also adds the unit test TestSample. I am unfamiliar with LogicalPlan crafting, so the code might not be the best. Please feel free to clean it up. > Command that selects a random sample of the rows, similar to LIMIT > ------------------------------------------------------------------ > > Key: PIG-795 > URL: https://issues.apache.org/jira/browse/PIG-795 > Project: Pig > Issue Type: New Feature > Components: impl > Reporter: Eric Gaudet > Priority: Trivial > Attachments: sample2.diff, sample3.diff > > > When working with very large data sets (imagine that!), running a pig script > can take time. It may be useful to run on a small subset of the data in some > situations (eg: debugging / testing, or to get fast results even if less > accurate.) > The command "LIMIT N" selects the first N rows of the data, but these are not > necessarily randomzed. A command "SAMPLE X" would retain the row only with > the probability x%. > Note: it is possible to implement this feature with FILTER BY and an UDF, but > so is LIMIT, and limit is built-in. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.