I must be missing some tricky detail... Which of these operations could not be 
done by clever udfs?

On Aug 9, 2012, at 9:01 AM, Gianmarco De Francisci Morales <g...@apache.org> 
wrote:

> Hi Allan,
> I think I found an answer to your problem:
> 
> 1) Modify PhysicalPlanResetter by adding:
> 
>    @Override
> 
>    public void visitCounter(POCounter counter) throws VisitorException {
> 
>        counter.reset();
> 
>    }
> 
> 
> 2) Modify POCounter by adding
> 
>    @Override
> 
>    public void reset() {
> 
>        localCount = 0L;
> 
>        taskID = "-1";
> 
>        incrementer = 1;
> 
>    }
> 
> 
> 
> I get this result on this file + script:
> 
> ------------------------------------------
> 
> | a     | id:int    | value:chararray    |
> 
> ------------------------------------------
> 
> |       | 6         | g                  |
> 
> ------------------------------------------
> 
> -----------------------------------------------------------
> 
> | b     | rank_a:long    | id:int    | value:chararray    |
> 
> -----------------------------------------------------------
> 
> |       | 1              | 6         | g                  |
> 
> -----------------------------------------------------------
> 
> 
> 
> grunt> cat file.txt
> 
> 1 a
> 
> 2 b
> 
> 3 c
> 
> 3 d
> 
> 4 e
> 
> 6 f
> 
> 6 g
> 
> 8 h
> 
> 
> 
> 
> grunt> a = load 'file.txt' as (id:int, value:chararray);
> 
> grunt> b = rank a;
> 
> grunt> illustrate b
> 
> 
> 
> 
> Hope it helps.
> 
> 
> Cheers,
> 
> --
> Gianmarco
> 
> 
> 
> On Tue, Aug 7, 2012 at 4:34 PM, Allan <aaven...@gmail.com> wrote:
> 
>> Hi to everybody!
>> 
>> I'm working on the implementation of rank operator, which successfully
>> passed all the e2e tests on a cluster.
>> Rank operator is composed by two physical operators: POCounter and PORank,
>> and it provides two functionalities:
>> 
>> 1) First functionality is similar to ROW NUMBER like on SQL, which
>> provides a sequential number to each tuple.
>> This is implemented by two map-only works (one for each physical
>> operator).
>> 
>> - POCounter adds to each tuple the task identifier (which is processing
>> it) and a local counter.  Furthermore, POCounter register the total number
>> of processed tuples by each task, through the used of global counters.
>> After finished the POCounter, it is calculated the cumulative sum, which
>> is the summation of the total tuples processed by previous tasks, i.e. for
>> task0 cumulative sum is 0 (there is not tuples before), task1 cumulative
>> sum is the number of tuples processed by task0 (the only task before it is
>> task0), and so on.
>> 
>> - Finally, PORank reads the corresponding cumulative according to the task
>> id of each tuple and sums the local counter at the tuple.
>> 
>> An input example for the POCount could be:
>> 
>> (1,n,5)
>> (8,a,0)
>> (0,b,9)
>> 
>> result of POCounter, and input to the PORank:
>> 
>> (0,1,1,n,5)
>> (0,2,8,a,0)
>> (0,3,0,b,9)
>> 
>> and result after PORank processing:
>> 
>> (1,1,n,5)
>> (2,8,a,0)
>> (3,0,b,9)
>> 
>> 
>> 2) Second functionality is RANK BY, which is based on set of ordered
>> columns.
>> And it requires another methodology:
>> First, the dataset is group by the desired columns. Then, this result is
>> sorted by the columns specified. And, at the end this result is processed
>> by POCounter and PORank.
>> As in the previous case, POCounter adds to each tuple the task identifier
>> and the local counter. But here, local counter is not sequentially
>> incremented. Instead, it is added the number of tuples in the bag (produced
>> within the previous "group by").
>> Another particular change is the fact of the global counter is also
>> incremented by the size of bags on each tuple.
>> 
>> Finally, PORank does the same as the previous implementation without
>> change. After that, the rank column is spread to each component on the bag
>> within a for each operator.
>> 
>> An input example for the POCounter (after sorting and grouping):
>> On this case, I would like to rank by the first column.
>> 
>> (0,{(0,b,9)})
>> (1,{(1,n,5)})
>> (8,{(8,a,0)})
>> 
>> And after being processed by POCounter, and an input example for the
>> PORank:
>> 
>> (0,1,0,{(0,b,9)})
>> (0,2,1,{(1,n,5)})
>> (0,3,8,{(8,a,0)})
>> 
>> Then, the resulting after PORank:
>> 
>> (1,0,{(0,b,9)})
>> (2,1,{(1,n,5)})
>> (3,8,{(8,a,0)})
>> 
>> Finally, the rank value is spread to each element at the bag through a for
>> each operator, resulting:
>> 
>> (1,0,b,9)
>> (2,1,n,5)
>> (3,8,a,0)
>> 
>> After testing some options, I got a way to illustrate the rank operator,
>> but I have some problems:
>> 
>> 1.- I guess that due to the illustrator algorithm, resulting tuples after
>> POCounter produces numbers high counters values two or three times than
>> expected, for example:
>> (0,38,1,n,5)
>> (0,39,8,a,0)
>> (0,40,0,b,9)
>> 
>> 2.- Until now, I get 1 tuple example after illustrate. How could I get at
>> least three or four tuples as result?
>> 
>> Thanks in advance for your replies,
>> 
>> --
>> 
>> Allan AvendaƱo S.
>> Computer Engineer
>> Ex-SWY22 Participant
>> Rome - Italy
>> Gmail: aaven...@gmail.com
>> --
>> 
>> 

Reply via email to