[ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric Gaudet updated PIG-795: ---------------------------- Attachment: sample2.diff This patch implements the SAMPLE command. It basically add a random sample mode to the LIMIT class. The syntax is like LIMIT: "a = SAMPLE x", where x is an integer and 0<=x<=100. Each row will be selected if rand()<(x/100). Example: a = LOAD 'mybigdata' b = SAMPLE 5 ... will select 5% of the data. > Command that selects a random sample of the rows, similar to LIMIT > ------------------------------------------------------------------ > > Key: PIG-795 > URL: https://issues.apache.org/jira/browse/PIG-795 > Project: Pig > Issue Type: New Feature > Components: impl > Reporter: Eric Gaudet > Priority: Trivial > Attachments: sample2.diff > > > When working with very large data sets (imagine that!), running a pig script > can take time. It may be useful to run on a small subset of the data in some > situations (eg: debugging / testing, or to get fast results even if less > accurate.) > The command "LIMIT N" selects the first N rows of the data, but these are not > necessarily randomzed. A command "SAMPLE X" would retain the row only with > the probability x%. > Note: it is possible to implement this feature with FILTER BY and an UDF, but > so is LIMIT, and limit is built-in. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.