[ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705072#action_12705072 ]
Eric Gaudet commented on PIG-795: --------------------------------- Thanks for your feedback. (BTW, should these issues be discussed in a different place?) Here's my comments: 1) I agree that the 1% minimum looks arbitrary and annoying, but I decided to keep it like this for several reasons. Most importantly, I didn't want to disturb the syntax of LIMIT, which expects an integer. Secondly, 1% is a reasonable minimum if you want a statistically significant result. And finally, you can work around the limitation by adding a 2nd level of sample (or more): b = SAMPLE a 1; c = SAMPLE b 1; gives you 0.01%. Now that I think about it, it's easy to change the syntax and use a float for SAMPLE. The value would be a probability between 0.0 and 1.0. It's cleaner this way, and I will send a new patch for that. 2) I implemented it in limit because they are both specialized filters in a way, with a similar syntax. This way the code changes are very small. It already exists as a filter without any coding needed: b = FILTER a BY org.apache.pig.piggybank.evaluation.math.RANDOM()<0.01; The syntax not very user friendly, though. 3) Will add unit tests in the new patch with floats. I will produce a new patch with the float syntax and unit tests in the next few days, unless you tell me you prefer FILTER BY. > Command that selects a random sample of the rows, similar to LIMIT > ------------------------------------------------------------------ > > Key: PIG-795 > URL: https://issues.apache.org/jira/browse/PIG-795 > Project: Pig > Issue Type: New Feature > Components: impl > Reporter: Eric Gaudet > Priority: Trivial > Attachments: sample2.diff > > > When working with very large data sets (imagine that!), running a pig script > can take time. It may be useful to run on a small subset of the data in some > situations (eg: debugging / testing, or to get fast results even if less > accurate.) > The command "LIMIT N" selects the first N rows of the data, but these are not > necessarily randomzed. A command "SAMPLE X" would retain the row only with > the probability x%. > Note: it is possible to implement this feature with FILTER BY and an UDF, but > so is LIMIT, and limit is built-in. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.