[ 
https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705037#action_12705037
 ] 

Alan Gates commented on PIG-795:
--------------------------------

Eric,

Thanks for the patch.  I agree this is a feature that people will find useful.  
I have a few questions and comments:

1) Is 1% the minimum sample size people will want to work with?  Given that 
data in the grid can be on the order of terabytes, I can see people wanting a 
0.1% sample, or even 0.01% sample.  Maybe that's too hard to specify nicely in 
the syntax, or maybe people will be happy with 1% minimum.  I'm not sure, but 
it's worth thinking about.

2) Sample and limit aren't really related, so implementing this in limit seems 
artificial.  Could it instead be implemented as a filter with a random 
function?  So the grammar production would look like:

X = SAMPLE Y a% => X = FILTER Y BY a < RANDOM();

with RANDOM being a function you added to return a random number.

The advantage of this is we would hope in the future to push filter operators 
down into the load functions themselves.  intelligent load functions could then 
take this filter and not even deserialize a record until it decided whether it 
was going to be kept or not.

3) The patch should include unit tests.

> Command that selects a random sample of the rows, similar to LIMIT
> ------------------------------------------------------------------
>
>                 Key: PIG-795
>                 URL: https://issues.apache.org/jira/browse/PIG-795
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Eric Gaudet
>            Priority: Trivial
>         Attachments: sample2.diff
>
>
> When working with very large data sets (imagine that!), running a pig script 
> can take time. It may be useful to run on a small subset of the data in some 
> situations (eg: debugging / testing, or to get fast results even if less 
> accurate.) 
> The command "LIMIT N" selects the first N rows of the data, but these are not 
> necessarily randomzed. A command "SAMPLE X" would retain the row only with 
> the probability x%.
> Note: it is possible to implement this feature with FILTER BY and an UDF, but 
> so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to