Ben Rahamim created DATAFU-176: ---------------------------------- Summary: Add a way to do dedupTopN with combiner Key: DATAFU-176 URL: https://issues.apache.org/jira/browse/DATAFU-176 Project: DataFu Issue Type: New Feature Affects Versions: 2.1.0 Reporter: Ben Rahamim
In a lot of our solutions, we select only a fixed number of rows, based on ordering by a column, usually a small amount. Datafu has dedupTopN, which uses a window function, and dedupWithCombiner, which is limited to only taking one record per grouping. dedupTopN is using a window function, which is of course not efficient because it orders all of the rows per group, and is very susceptible to skew. DedupWithCombiner won't let us take more than 1 row. A better solution would be to write a class, like dedupWithCombiner, that allows selecting many rows. One possible solution will be a class that implements DeclarativeAggregate, to avoid having to declare the schemas explicitly and use the combiner to avoid skew and also Codegen. I have prepared code that does this and will submit it as a PR. -- This message was sent by Atlassian Jira (v8.20.10#820010)