andygrove commented on issue #2925: URL: https://github.com/apache/arrow-datafusion/issues/2925#issuecomment-1186250273
So this is a bit more involved than I had hoped. Per the discussion in https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1016877360 TLDR is I think `median` is quite complicated to get both correct and high performance. The key problem is that you basically need to have buffered the entire input stream before you know what the output is Starting with an approximation of median would be fine and I suspect other users of DataFusion would find it valuable. To implement an exact median, you could implement an `Aggregator` like we have for other aggregates that buffers up all the input, sorts it, and produces the median value. It would have terrible memory consumption but is probably reasonably straightforward to implement using existing compute kernels as @matthewmturner suggests If you wanted to try and reuse existing operators (e.g. `LogicalPlan` and `ExecutionPlan`), I think that would be tough. The mapping of `SQL` --> LogicalPlan is formulaic but non trivial. All aggregate functions are treated the same (create the same `LogicalPlan` nodes, with different expr lists, so if we special case `median` with a special operator we'll have to sort out how to handle queries like `SELECT sum(foo), median(foo) from bar GROUP BY baz` which I think will be complicated Another challenge with median is that there is no way to partially aggregate intermediate results. DataFusion currently makes this pattern (which is useful when calculating sums, for example, because you can calculate partial sums and then sum them together) ``` ┌──────────┐ │ │ │Aggregator│ │ (Final) │ │ │ │ │ └──────────┘ ▲ │ ┌────────────────┬──────┴────────────────────┐ │ │ │ │ │ │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ │ │ │ │Aggregator│ │Aggregator│ │Aggregator│ │(Partial) │ │(Partial) │ ... │(Partial) │ │ │ │ │ │ │ │ │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ ▲ ▲ ▲ │ │ │ └────────────────┴──────┬────────────────────┘ │ │ ┌───────────┐ │ │ │Repartition│ │ │ │ │ └───────────┘ ▲ │ │ ┌───────────┐ │Input │ │ │ └───────────┘ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
