alamb edited a comment on issue #1486:
URL: 
https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1016877360


   (we should probably bring this discussion into a new ticket, FWIW)
   
   TLDR is I think `median` is quite complicated to get both correct and high 
performance. 
   
   The key problem is that you basically need to have buffered the entire input 
stream before you know what the output is
   
   Starting with an approximation of median would be fine and I suspect other 
users of DataFusion would find it valuable. 
   
   To implement an exact median, you could implement an `Aggregator` like we 
have for other aggregates that buffers up all the input, sorts it,  and 
produces the median value. It would have terrible memory consumption but is 
probably reasonably straightforward to implement using existing compute kernels 
as @matthewmturner suggests
   
   If you wanted to try and reuse existing operators (e.g. `LogicalPlan` and 
`ExecutionPlan`), I think that would be tough. 
   
   The mapping of `SQL` --> LogicalPlan is formulaic but non trivial. All 
aggregate functions are treated the same (create the same `LogicalPlan` nodes, 
with different expr lists, so if we special case `median` with a special  
operator we'll have to sort out how to handle queries like `SELECT sum(foo), 
median(foo) from bar GROUP BY baz` which I think will be complicated
   
   Another  challenge with median is that there is no way to partially 
aggregate intermediate results. DataFusion currently makes this pattern (which 
is useful when calculating sums, for example, because you can calculate partial 
sums and then sum them together)
   
   ```
                           ┌──────────┐                     
                           │          │                     
                           │Aggregator│                     
                           │ (Final)  │                     
                           │          │                     
                           │          │                     
                           └──────────┘                     
                                 ▲                          
                                 │                          
         ┌────────────────┬──────┴────────────────────┐     
         │                │                           │     
         │                │                           │     
         │                │                           │     
   ┌──────────┐     ┌──────────┐                ┌──────────┐
   │          │     │          │                │          │
   │Aggregator│     │Aggregator│                │Aggregator│
   │(Partial) │     │(Partial) │         ...    │(Partial) │
   │          │     │          │                │          │
   │          │     │          │                │          │
   └──────────┘     └──────────┘                └──────────┘
         ▲                ▲                           ▲     
         │                │                           │     
         └────────────────┴──────┬────────────────────┘     
                                 │                          
                                 │                          
                           ┌───────────┐                    
                           │           │                    
                           │Repartition│                    
                           │           │                    
                           │           │                    
                           └───────────┘                    
                                 ▲                          
                                 │                          
                                 │                          
                           ┌───────────┐                    
                           │Input      │                    
                           │           │                    
                           └───────────┘                    
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to