[
https://issues.apache.org/jira/browse/PIG-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thejas M Nair updated PIG-1926:
-------------------------------
Release Note:
Limit and Sample now accept a variable (scalar) as argument.
For example, the new Limit command allows the following syntax to get the top
1% of a sorted file:
[ a = LOAD 'a.txt'; b = GROUP a all; c = FOREACH b GENERATE COUNT(a) AS sum; d
= ORDER a BY $0; e = LIMIT d c.sum/100; ]
Only scalar variables may be used in the expression in limit or sample, columns
in the input relation for the operation cannot be used in the expression. A
statement like [ e = LIMIT d $0; ] is invalid.
The new Sample command allows for the same syntax.
Using a variable instead of a constant in Limit automatically disables most of
the optimizations (only push-before-foreach is performed). More work is needed
to enable optimizations for limit-after-sort, limit duplication before
cross/union and limit merging.
was:
Limit and Sample now accept a variable (scalar) as argument. Using a variable
instead of a constant in Limit automatically disables most of the optimizations
(only push-before-foreach is performed). More work is needed to enable
optimizations for limit-after-sort, limit duplication before cross/union and
limit merging.
The new Limit command allows the following syntax to get the top 1% of a sorted
file:
[ a = LOAD 'a.txt'; b = GROUP a all; c = FOREACH b GENERATE COUNT(a) AS sum; d
= ORDER a BY $0; e = LIMIT d c.sum/100; ]
Only aggregate variables may be used as argument for limit. A statement like [
e = LIMIT d $0; ] is invalid.
The new Sample command allows for the same syntax.
> Sample/Limit should take scalar
> -------------------------------
>
> Key: PIG-1926
> URL: https://issues.apache.org/jira/browse/PIG-1926
> Project: Pig
> Issue Type: Improvement
> Reporter: Daniel Dai
> Assignee: Gianmarco De Francisci Morales
> Labels: gsoc2011
> Fix For: 0.10
>
> Attachments: PIG-1926.10.patch, PIG-1926.11.patch,
> PIG-1926.12.1.patch, PIG-1926.12.patch, PIG-1926.7.patch, PIG-1926.8.patch,
> PIG-1926.9.patch, PIG-1926.patch, PIG-1926.patch, PIG-1926.patch,
> PIG-1926.patch, PIG-1926.patch, PIG-1926.patch
>
>
> Currently, Limit, Sample only takes a constant. It would be better we can use
> a scalar in the place of constant. Eg:
> {code}
> a = load 'a.txt';
> b = group a all;
> c = foreach b generate COUNT(a) as sum;
> d = order a by $0;
> e = limit d c.sum/100;
> {code}
> This is a candidate project for Google summer of code 2011. More information
> about the program can be found at http://wiki.apache.org/pig/GSoc2011
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira