[ https://issues.apache.org/jira/browse/PIG-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13630490#comment-13630490 ]
Mythili Gopalakrishnan commented on PIG-821: -------------------------------------------- Hi Is it possible to do RANK over a GROUP? For example in the following dataset, I would like to rank by Income and then rank within Industry by Income.. Name Industry Income John Banking 20,000 Jane Banking 35,000 Chen Real Estate 30,000 Hari Real Estate 22,000 Asha Technology 26,000 I tried the following script listed below but get ERROR unexpected SYMBOL at or near RANK. ..... names_by_ind = group names by industry; rank_by_ind = foreach names_by_ind { results = RANK names BY income DESC; GENERATE flatten(results); } > simulate NTILE(n) , rank() functionality in pig > ----------------------------------------------- > > Key: PIG-821 > URL: https://issues.apache.org/jira/browse/PIG-821 > Project: Pig > Issue Type: New Feature > Components: impl > Affects Versions: 0.2.0 > Environment: mithril gold -gateway 4000 > Reporter: Rekha > Fix For: 0.11 > > > Hi, > I came across a job which has some processing which I cant seem to get easily > over-the-counter from pig. > These are NTILE() /rank() operations available in oracle. > While I am trying to write a UDF, that is not working out too well for me > yet.. :( > I have a ntile(n) over (partititon by x, y, z order by a desc, b desc) > operation to be done in pig scripts. > Is there a default function in pig scripting which can do this? > For example, lets consider a simple example at > http://download.oracle.com/docs/cd/B14117_01/server.101/b10759/functions091.htm > So here, how would we ideally substitute NTILE() with? any pig counterpart > function/udf? > SELECT last_name, salary, NTILE(4) OVER (ORDER BY salary DESC) > AS quartile FROM employees > WHERE department_id = 100; > > LAST_NAME SALARY QUARTILE > ------------------------- ---------- ---------- > Greenberg 12000 1 > Faviet 9000 1 > Chen 8200 2 > Urman 7800 2 > Sciarra 7700 3 > Popp 6900 4 > > In real case, i have ntile over multiple columns, so ideal way to find > histograms/boundary/spitting out the bucket number is needed. > Similarly a pig function is required for rank() over(partition by a,b,c order > by d desc) as e > Please let me know soon. > Thanks & Regards, > /Rekha -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira