[ 
https://issues.apache.org/jira/browse/PIG-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611964#action_12611964
 ] 

Ajay Garg commented on PIG-296:
-------------------------------

h2. Specification of Cumulative Sum , row, rank, dense rank 



h2. {color:red}  Cumulative Sum{color} 

Useful for calculating cumulative distributions.

x[] is an ordered set of values

cumulative sum[1] = x[1]
cumulative sum[i] = cumulative sum[i] + x[i]

---
 h2. {color:red}Row{color}

Label of the nth item in an ordered set.

x[] is an ordered set of values

i = 1;

row[i] = i

i++;


||query||freq||rank||
|myspace|20,000|1|
|facebook|15,000|2|
|answers|10,000|3|
|yahoo|5,000|4|
|irs|5,000|5|
|news|4,000|6|

---

h2.  {color:red}Rank{color}

Useful for calculating Zipf distributions. Duplicate values of x result in the 
same rank value. Gaps in the sequence values for Rank occur following a run of 
duplicate values of x.

x[] is an ordered set of values

i = 1;

if (i == 1) {
    rank[1] = 1
} else if (x[i] == x[i-1]) {
    rank[i] = rank[i-1]
} else {
    rank[i] = i
}

i++

||query||freq||rank||
|myspace|20,000|1|
|facebook|15,000|2|
|answers|10,000|3|
|yahoo|5,000|4|
|irs|5,000|4|
|news|4,000|6|

---

 h2. {color:red}Dense Rank{color}

Useful for calculating top-N or bottom-N. Unlike Rank, there are no gaps in the 
sequence values for Dense Rank.

x[] is an ordered set of values

if (i == 1) {
    dense_rank[1] = 1
} else if (x[i] == x[i-1]) {
    dense_rank[i] = dense_rank[i-1]
} else {
    dense_rank[i] = dense_rank[i-1] + 1
}

[i] and [i-1] can be represented using current and previous values and need not 
use an indexed array

||query||freq||rank||
|myspace|20,000|1|
|facebook|15,000|2|
|answers|10,000|3|
|yahoo|5,000|4|
|irs|5,000|4|
|news|4,000|5|

> UDF for cumulative statistics
> -----------------------------
>
>                 Key: PIG-296
>                 URL: https://issues.apache.org/jira/browse/PIG-296
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Ajay Garg
>            Priority: Minor
>         Attachments: cumulative.patch
>
>
> udf for computive cumulative sum, row, rank, dense rank. visit 
> http://twiki.corp.yahoo.com/view/YResearch/PigStatisticsCumulative for 
> detailed description. 
> To use 
> A = load 'data' using PigStorage as ( query, freq );
> B = group A all;
> C = foreach B {
>     Ordered = order A by freq using numeric.OrderDescending;
>     generate
>         statistics.CUMULATIVE_COLUMN(Ordered, 1) as   -- Pig starts with 0th 
> column, this refers to the column freq by offset
>                 ( query, freq, freq_cumulative_sum, freq_row, freq_rank, 
> freq_dense_rank );
> };
> D = foreach C generate FLATTEN(A);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to