I must be missing some tricky detail... Which of these operations could not be done by clever udfs?
On Aug 9, 2012, at 9:01 AM, Gianmarco De Francisci Morales <g...@apache.org> wrote: > Hi Allan, > I think I found an answer to your problem: > > 1) Modify PhysicalPlanResetter by adding: > > @Override > > public void visitCounter(POCounter counter) throws VisitorException { > > counter.reset(); > > } > > > 2) Modify POCounter by adding > > @Override > > public void reset() { > > localCount = 0L; > > taskID = "-1"; > > incrementer = 1; > > } > > > > I get this result on this file + script: > > ------------------------------------------ > > | a | id:int | value:chararray | > > ------------------------------------------ > > | | 6 | g | > > ------------------------------------------ > > ----------------------------------------------------------- > > | b | rank_a:long | id:int | value:chararray | > > ----------------------------------------------------------- > > | | 1 | 6 | g | > > ----------------------------------------------------------- > > > > grunt> cat file.txt > > 1 a > > 2 b > > 3 c > > 3 d > > 4 e > > 6 f > > 6 g > > 8 h > > > > > grunt> a = load 'file.txt' as (id:int, value:chararray); > > grunt> b = rank a; > > grunt> illustrate b > > > > > Hope it helps. > > > Cheers, > > -- > Gianmarco > > > > On Tue, Aug 7, 2012 at 4:34 PM, Allan <aaven...@gmail.com> wrote: > >> Hi to everybody! >> >> I'm working on the implementation of rank operator, which successfully >> passed all the e2e tests on a cluster. >> Rank operator is composed by two physical operators: POCounter and PORank, >> and it provides two functionalities: >> >> 1) First functionality is similar to ROW NUMBER like on SQL, which >> provides a sequential number to each tuple. >> This is implemented by two map-only works (one for each physical >> operator). >> >> - POCounter adds to each tuple the task identifier (which is processing >> it) and a local counter. Furthermore, POCounter register the total number >> of processed tuples by each task, through the used of global counters. >> After finished the POCounter, it is calculated the cumulative sum, which >> is the summation of the total tuples processed by previous tasks, i.e. for >> task0 cumulative sum is 0 (there is not tuples before), task1 cumulative >> sum is the number of tuples processed by task0 (the only task before it is >> task0), and so on. >> >> - Finally, PORank reads the corresponding cumulative according to the task >> id of each tuple and sums the local counter at the tuple. >> >> An input example for the POCount could be: >> >> (1,n,5) >> (8,a,0) >> (0,b,9) >> >> result of POCounter, and input to the PORank: >> >> (0,1,1,n,5) >> (0,2,8,a,0) >> (0,3,0,b,9) >> >> and result after PORank processing: >> >> (1,1,n,5) >> (2,8,a,0) >> (3,0,b,9) >> >> >> 2) Second functionality is RANK BY, which is based on set of ordered >> columns. >> And it requires another methodology: >> First, the dataset is group by the desired columns. Then, this result is >> sorted by the columns specified. And, at the end this result is processed >> by POCounter and PORank. >> As in the previous case, POCounter adds to each tuple the task identifier >> and the local counter. But here, local counter is not sequentially >> incremented. Instead, it is added the number of tuples in the bag (produced >> within the previous "group by"). >> Another particular change is the fact of the global counter is also >> incremented by the size of bags on each tuple. >> >> Finally, PORank does the same as the previous implementation without >> change. After that, the rank column is spread to each component on the bag >> within a for each operator. >> >> An input example for the POCounter (after sorting and grouping): >> On this case, I would like to rank by the first column. >> >> (0,{(0,b,9)}) >> (1,{(1,n,5)}) >> (8,{(8,a,0)}) >> >> And after being processed by POCounter, and an input example for the >> PORank: >> >> (0,1,0,{(0,b,9)}) >> (0,2,1,{(1,n,5)}) >> (0,3,8,{(8,a,0)}) >> >> Then, the resulting after PORank: >> >> (1,0,{(0,b,9)}) >> (2,1,{(1,n,5)}) >> (3,8,{(8,a,0)}) >> >> Finally, the rank value is spread to each element at the bag through a for >> each operator, resulting: >> >> (1,0,b,9) >> (2,1,n,5) >> (3,8,a,0) >> >> After testing some options, I got a way to illustrate the rank operator, >> but I have some problems: >> >> 1.- I guess that due to the illustrator algorithm, resulting tuples after >> POCounter produces numbers high counters values two or three times than >> expected, for example: >> (0,38,1,n,5) >> (0,39,8,a,0) >> (0,40,0,b,9) >> >> 2.- Until now, I get 1 tuple example after illustrate. How could I get at >> least three or four tuples as result? >> >> Thanks in advance for your replies, >> >> -- >> >> Allan AvendaƱo S. >> Computer Engineer >> Ex-SWY22 Participant >> Rome - Italy >> Gmail: aaven...@gmail.com >> -- >> >>