[jira] Commented: (PIG-821) simulate NTILE(n) , rank() functionality in pig

David Ciemiewicz (JIRA) Sat, 19 Feb 2011 16:42:01 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996938#comment-12996938
 ]


David Ciemiewicz commented on PIG-821:
--------------------------------------

I tried doing an NTILE and generic UDF for computing cummulative statistics 
such as decile, rank, denserank, etc for working on query streams.  With a 
billion or more queries, the serialization of sequentially processing every 
single row was a bit slow.

What I realized was that the frequency histogram for something like a 
"powerlaw" distribution of a billion queries would be pretty compact (something 
like only 20K unique frequencies).  Computing sequential commulative statistics 
on 20K, 100K, or even 1M rows is far more efficient than doing it for >1B rows.

So I used map reduce to compute the frequency histogram, compute the deciles on 
the frequency histogram, then map them back to the individual queries which 
then exploits the parallelism of map-reduce.

With the introduction of Jython scripting for UDFs in Pig 0.8, it makes it 
pretty simple to distribute the example.  Furthermore, once inline Jython 
scripts are implemented (as requested in another JIRA, this will be even more 
convenient.

Here's the Pig:
{code}

register 'ntiles.py' using jython as ntiles;

Queries = load 'queries.txt' using PigStorage() as ( query: chararray, freq: 
long );

FreqGroup = group Queries by ( freq );

FreqHisto = foreach FreqGroup generate
        group as freq,
        COUNT(Queries) as count;

FreqHistoAll = group FreqHisto all;

CummStats = foreach FreqHistoAll {
        FreqHistoOrdered = order FreqHisto by freq desc;
        generate
        FLATTEN(ntiles.freqhisto_cummstats(FreqHistoOrdered, 10));
        };

QueryDeciles = join Queries by freq, CummStats by freq;

QueryDeciles = order QueryDeciles by freq desc, query;

store QueryDeciles into 'query-deciles.txt' using PigStorage();
{code}

Here's the Jython function for computing the cumulative stats on the frequency 
historgram:
{code}

@outputSchema("freqhisto_cummstats:bag{stats:tuple(freq:long,count:long,total:long,proportion:double,cummcount:long,cummproportion:double,ntile:long,rank:long,denserank:long)}")
def freqhisto_cummstats(freqhisto, ntiles):
  outBag = []
  totalcount = 0
  for freqcount in freqhisto:
    freq = freqcount[0]
    count = freqcount[1]
    totalcount += freq * count
  cummcount = 0
  ntile = 1
  rank = 1
  denserank = 1
  for freqcount in freqhisto:
    freq = freqcount[0]
    count = freqcount[1]
    cummcount += freq * count
    proportion = float(freq) * count / totalcount
    cummproportion = float(cummcount) / totalcount
    if cummproportion >= float(ntile)/ntiles:
      ntile += 1
      if ntile > ntiles:
        ntile = ntiles
    tup=(freq, count, totalcount, proportion, cummcount, cummproportion, ntile, 
rank, denserank)
    outBag.append(tup)
    rank += 1
    denserank += freq * count
  return outBag
{code}

Here's a sample Perl program to generate "faux" query log data:
{code}

#! /usr/bin/perl

use Digest::MD5 qw(md5 md5_hex md5_base64);
$MAX = 1001;

foreach $i ( 1 .. $MAX ) {
    foreach $j ( 1 .. $i ) {
        $digest = md5_base64($i*$MAX+$j);
        print join("\t", $digest, $MAX-$i), "\n";
    }
}
{code}

> simulate NTILE(n) , rank() functionality in pig
> -----------------------------------------------
>
>                 Key: PIG-821
>                 URL: https://issues.apache.org/jira/browse/PIG-821
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.2.0
>         Environment: mithril gold -gateway 4000
>            Reporter: Rekha
>
> Hi,
> I came across a job which has some processing which I cant seem to get easily 
> over-the-counter from pig.
> These are NTILE() /rank() operations available in oracle.
> While I am trying to write a UDF, that is not working out too well for me 
> yet.. :(
> I have a ntile(n) over (partititon by x, y, z order by a desc, b desc) 
> operation to be done in pig scripts.
> Is there a default function in pig scripting which can do this?
> For example, lets consider a simple example at 
> http://download.oracle.com/docs/cd/B14117_01/server.101/b10759/functions091.htm
> So here, how would we ideally substitute NTILE() with? any pig counterpart 
> function/udf?
> SELECT last_name, salary, NTILE(4) OVER (ORDER BY salary DESC) 
>    AS quartile FROM employees
>    WHERE department_id = 100;
>  
> LAST_NAME                     SALARY   QUARTILE
> ------------------------- ---------- ----------
> Greenberg                      12000          1
> Faviet                          9000          1
> Chen                            8200          2
> Urman                           7800          2
> Sciarra                         7700          3
> Popp                            6900          4
>  
> In real case, i have ntile over multiple columns, so ideal way to find 
> histograms/boundary/spitting out the bucket number is needed.
> Similarly a pig function is required for rank() over(partition by a,b,c order 
> by d desc) as e
> Please let me know soon.
> Thanks & Regards,
> /Rekha

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-821) simulate NTILE(n) , rank() functionality in pig

Reply via email to