[
https://issues.apache.org/jira/browse/PIG-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996938#comment-12996938
]
David Ciemiewicz commented on PIG-821:
--------------------------------------
I tried doing an NTILE and generic UDF for computing cummulative statistics
such as decile, rank, denserank, etc for working on query streams. With a
billion or more queries, the serialization of sequentially processing every
single row was a bit slow.
What I realized was that the frequency histogram for something like a
"powerlaw" distribution of a billion queries would be pretty compact (something
like only 20K unique frequencies). Computing sequential commulative statistics
on 20K, 100K, or even 1M rows is far more efficient than doing it for >1B rows.
So I used map reduce to compute the frequency histogram, compute the deciles on
the frequency histogram, then map them back to the individual queries which
then exploits the parallelism of map-reduce.
With the introduction of Jython scripting for UDFs in Pig 0.8, it makes it
pretty simple to distribute the example. Furthermore, once inline Jython
scripts are implemented (as requested in another JIRA, this will be even more
convenient.
Here's the Pig:
{code}
register 'ntiles.py' using jython as ntiles;
Queries = load 'queries.txt' using PigStorage() as ( query: chararray, freq:
long );
FreqGroup = group Queries by ( freq );
FreqHisto = foreach FreqGroup generate
group as freq,
COUNT(Queries) as count;
FreqHistoAll = group FreqHisto all;
CummStats = foreach FreqHistoAll {
FreqHistoOrdered = order FreqHisto by freq desc;
generate
FLATTEN(ntiles.freqhisto_cummstats(FreqHistoOrdered, 10));
};
QueryDeciles = join Queries by freq, CummStats by freq;
QueryDeciles = order QueryDeciles by freq desc, query;
store QueryDeciles into 'query-deciles.txt' using PigStorage();
{code}
Here's the Jython function for computing the cumulative stats on the frequency
historgram:
{code}
@outputSchema("freqhisto_cummstats:bag{stats:tuple(freq:long,count:long,total:long,proportion:double,cummcount:long,cummproportion:double,ntile:long,rank:long,denserank:long)}")
def freqhisto_cummstats(freqhisto, ntiles):
outBag = []
totalcount = 0
for freqcount in freqhisto:
freq = freqcount[0]
count = freqcount[1]
totalcount += freq * count
cummcount = 0
ntile = 1
rank = 1
denserank = 1
for freqcount in freqhisto:
freq = freqcount[0]
count = freqcount[1]
cummcount += freq * count
proportion = float(freq) * count / totalcount
cummproportion = float(cummcount) / totalcount
if cummproportion >= float(ntile)/ntiles:
ntile += 1
if ntile > ntiles:
ntile = ntiles
tup=(freq, count, totalcount, proportion, cummcount, cummproportion, ntile,
rank, denserank)
outBag.append(tup)
rank += 1
denserank += freq * count
return outBag
{code}
Here's a sample Perl program to generate "faux" query log data:
{code}
#! /usr/bin/perl
use Digest::MD5 qw(md5 md5_hex md5_base64);
$MAX = 1001;
foreach $i ( 1 .. $MAX ) {
foreach $j ( 1 .. $i ) {
$digest = md5_base64($i*$MAX+$j);
print join("\t", $digest, $MAX-$i), "\n";
}
}
{code}
> simulate NTILE(n) , rank() functionality in pig
> -----------------------------------------------
>
> Key: PIG-821
> URL: https://issues.apache.org/jira/browse/PIG-821
> Project: Pig
> Issue Type: New Feature
> Components: impl
> Affects Versions: 0.2.0
> Environment: mithril gold -gateway 4000
> Reporter: Rekha
>
> Hi,
> I came across a job which has some processing which I cant seem to get easily
> over-the-counter from pig.
> These are NTILE() /rank() operations available in oracle.
> While I am trying to write a UDF, that is not working out too well for me
> yet.. :(
> I have a ntile(n) over (partititon by x, y, z order by a desc, b desc)
> operation to be done in pig scripts.
> Is there a default function in pig scripting which can do this?
> For example, lets consider a simple example at
> http://download.oracle.com/docs/cd/B14117_01/server.101/b10759/functions091.htm
> So here, how would we ideally substitute NTILE() with? any pig counterpart
> function/udf?
> SELECT last_name, salary, NTILE(4) OVER (ORDER BY salary DESC)
> AS quartile FROM employees
> WHERE department_id = 100;
>
> LAST_NAME SALARY QUARTILE
> ------------------------- ---------- ----------
> Greenberg 12000 1
> Faviet 9000 1
> Chen 8200 2
> Urman 7800 2
> Sciarra 7700 3
> Popp 6900 4
>
> In real case, i have ntile over multiple columns, so ideal way to find
> histograms/boundary/spitting out the bucket number is needed.
> Similarly a pig function is required for rank() over(partition by a,b,c order
> by d desc) as e
> Please let me know soon.
> Thanks & Regards,
> /Rekha
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira