alamb opened a new issue, #6906:
URL: https://github.com/apache/arrow-datafusion/issues/6906

   ### Is your feature request related to a problem or challenge?
   
   https://github.com/apache/arrow-datafusion/pull/6904 introduces some fancy 
new hashing and ways to implement aggregates
   
   min/max for strings (`StringArray` / `LargeStringArray`, etc) now uses the 
slower `Accumulator` implementation which could be made much faster
   
   ### Describe the solution you'd like
   
   I would like to implement a fast `GroupsAccumulator` for Min/Max
   
   
   
   ### Describe alternatives you've considered
   
   here is one potential way to implement it: 
   
   We could store the current minimum for all groups in the same Rows πŸ€” and 
track an index into that Rows for the current minimum for each group.
   
   This would require an extra copy of the input values, but it could probably 
be vectorized pretty well, as shown in the following diagram. 
   
   Sorry what I meant was something like the following where the accumulator 
only stored the current minimum values.
   
   This approach would potentially end up with `min_storage` being full of 
"garbage" if many batches had new minumums, but I think we could heuristically 
"compact" `min_storage` (if it had `2*num_groups`, for example) if it got too 
large
   
   ```
                                       β”Œ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐  
                                                                              
                                       β”‚           Accumulator             β”‚  
                                                   state                      
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  
   β”‚ β”Œβ”€β”€β”€β”€β”€β” β”‚       β”‚ β”Œβ”€β”€β”€β”€β”€β” β”‚         β”‚ β”Œβ”€β”€β”€β”€β”€β” β”‚          β”‚ β”Œβ”€β”€β”€β”€β”€β” β”‚     
   β”‚ β”‚  A  β”‚ β”‚       β”‚ β”‚  A  │─┼────┐  β”‚ β”‚ β”‚  D  β”‚ β”‚    β”Œβ”€β”€β”€β”€β”€β”Όβ”€β”‚  1  β”‚ β”‚  β”‚  
   β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚       β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚    β”‚    β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚    β”‚     β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚     
   β”‚ β”‚  B  β”‚ β”‚       β”‚ β”‚  B  β”‚ β”‚    └──┼─┼▢│  A  β”‚β—€β”Όβ”€β”€β”€β”€β”˜     β”‚ β”‚  0  β”‚ β”‚  β”‚  
   β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚       β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚         β”‚ β””β”€β”€β”€β”€β”€β”˜ β”‚          β”‚ β””β”€β”€β”€β”€β”€β”˜ β”‚     
   β”‚ β”‚  A  β”‚ β”‚       β”‚ β”‚  A  β”‚ β”‚       β”‚ β”‚         β”‚          β”‚         β”‚  β”‚  
   β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚       β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚         β”‚         β”‚          β”‚         β”‚     
   β”‚ β”‚  A  β”‚ β”‚       β”‚ β”‚  A  β”‚ β”‚       β”‚ β”‚         β”‚          β”‚         β”‚  β”‚  
   β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚       β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚         β”‚         β”‚          β”‚         β”‚     
   β”‚ β”‚  C  β”‚ β”‚       β”‚ β”‚  C  β”‚ β”‚       β”‚ β”‚         β”‚          β”‚         β”‚  β”‚  
   β”‚ β””β”€β”€β”€β”€β”€β”˜ β”‚       β”‚ β””β”€β”€β”€β”€β”€β”˜ β”‚         β”‚         β”‚          β”‚         β”‚     
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  
                                                                              
     input              input          β”‚ min_storage:         min_values   β”‚  
     values             values           Rows                                 
     (Array)            (Rows)         β”” ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ β”˜  
             step 1:           step 2: for                                    
             convert           any value                   step 3: min value  
             arguments to      that is a new               (per group) is     
             Row format        group                       tracked as an      
                               minimum, copy               index into         
                               it to a                     min_storage `Rows` 
                               second `Rows`                                  
                                                                              
   ```
   
   See 
https://github.com/apache/arrow-datafusion/pull/6800#issuecomment-1622290981 
for more details
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to