[
https://issues.apache.org/jira/browse/PIG-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611964#action_12611964
]
Ajay Garg commented on PIG-296:
-------------------------------
h2. Specification of Cumulative Sum , row, rank, dense rank
h2. {color:red} Cumulative Sum{color}
Useful for calculating cumulative distributions.
x[] is an ordered set of values
cumulative sum[1] = x[1]
cumulative sum[i] = cumulative sum[i] + x[i]
---
h2. {color:red}Row{color}
Label of the nth item in an ordered set.
x[] is an ordered set of values
i = 1;
row[i] = i
i++;
||query||freq||rank||
|myspace|20,000|1|
|facebook|15,000|2|
|answers|10,000|3|
|yahoo|5,000|4|
|irs|5,000|5|
|news|4,000|6|
---
h2. {color:red}Rank{color}
Useful for calculating Zipf distributions. Duplicate values of x result in the
same rank value. Gaps in the sequence values for Rank occur following a run of
duplicate values of x.
x[] is an ordered set of values
i = 1;
if (i == 1) {
rank[1] = 1
} else if (x[i] == x[i-1]) {
rank[i] = rank[i-1]
} else {
rank[i] = i
}
i++
||query||freq||rank||
|myspace|20,000|1|
|facebook|15,000|2|
|answers|10,000|3|
|yahoo|5,000|4|
|irs|5,000|4|
|news|4,000|6|
---
h2. {color:red}Dense Rank{color}
Useful for calculating top-N or bottom-N. Unlike Rank, there are no gaps in the
sequence values for Dense Rank.
x[] is an ordered set of values
if (i == 1) {
dense_rank[1] = 1
} else if (x[i] == x[i-1]) {
dense_rank[i] = dense_rank[i-1]
} else {
dense_rank[i] = dense_rank[i-1] + 1
}
[i] and [i-1] can be represented using current and previous values and need not
use an indexed array
||query||freq||rank||
|myspace|20,000|1|
|facebook|15,000|2|
|answers|10,000|3|
|yahoo|5,000|4|
|irs|5,000|4|
|news|4,000|5|
> UDF for cumulative statistics
> -----------------------------
>
> Key: PIG-296
> URL: https://issues.apache.org/jira/browse/PIG-296
> Project: Pig
> Issue Type: Improvement
> Reporter: Ajay Garg
> Priority: Minor
> Attachments: cumulative.patch
>
>
> udf for computive cumulative sum, row, rank, dense rank. visit
> http://twiki.corp.yahoo.com/view/YResearch/PigStatisticsCumulative for
> detailed description.
> To use
> A = load 'data' using PigStorage as ( query, freq );
> B = group A all;
> C = foreach B {
> Ordered = order A by freq using numeric.OrderDescending;
> generate
> statistics.CUMULATIVE_COLUMN(Ordered, 1) as -- Pig starts with 0th
> column, this refers to the column freq by offset
> ( query, freq, freq_cumulative_sum, freq_row, freq_rank,
> freq_dense_rank );
> };
> D = foreach C generate FLATTEN(A);
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.