Ben Rahamim created DATAFU-176:
----------------------------------

             Summary: Add a way to do dedupTopN with combiner
                 Key: DATAFU-176
                 URL: https://issues.apache.org/jira/browse/DATAFU-176
             Project: DataFu
          Issue Type: New Feature
    Affects Versions: 2.1.0
            Reporter: Ben Rahamim


In a lot of our solutions, we select only a fixed number of rows, based on 
ordering by a column, usually a small amount. Datafu has dedupTopN, which uses 
a window function, and dedupWithCombiner, which is limited to only taking one 
record per grouping. dedupTopN is using a window function, which is of course 
not efficient because it orders all of the rows per group, and is very 
susceptible to skew. DedupWithCombiner won't let us take more than 1 row. A 
better solution would be to write a class, like dedupWithCombiner, that allows 
selecting many rows. One possible solution will be a class that implements 
DeclarativeAggregate, to avoid having to declare the schemas explicitly and use 
the combiner to avoid skew and also Codegen.

 

I have prepared code that does this and will submit it as a PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to