[ 
https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705134#action_12705134
 ] 

Alan Gates commented on PIG-795:
--------------------------------

I think it's fine to have sample as a keyword.  It's valuable not just because 
it's easier syntax, but because in the future it could be expanded to more 
sophisticated sampling techniques beyond just taking a percentage of the data.  
For example:

B = SAMPLE A 1 USING 'mywhizbangnewsmaplingalgorithm';

What I meant was your patch could translate SAMPLE underneath into a filter.  
Then, instead of making changes in the limit code, all you need to do is move 
RANDOM from piggybank into pig's builtins, and change QueryParser.jjt to do the 
translation form SAMPLE to FILTER.

> Command that selects a random sample of the rows, similar to LIMIT
> ------------------------------------------------------------------
>
>                 Key: PIG-795
>                 URL: https://issues.apache.org/jira/browse/PIG-795
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Eric Gaudet
>            Priority: Trivial
>         Attachments: sample2.diff
>
>
> When working with very large data sets (imagine that!), running a pig script 
> can take time. It may be useful to run on a small subset of the data in some 
> situations (eg: debugging / testing, or to get fast results even if less 
> accurate.) 
> The command "LIMIT N" selects the first N rows of the data, but these are not 
> necessarily randomzed. A command "SAMPLE X" would retain the row only with 
> the probability x%.
> Note: it is possible to implement this feature with FILTER BY and an UDF, but 
> so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to