alamb commented on issue #7647:
URL: 
https://github.com/apache/arrow-datafusion/issues/7647#issuecomment-1879161073

   TLDR I recommend we revert this change and reopen this ticket while we 
reconsider how to handle this case better.
   
   # Background
   
   This ticket caused a functional regression for us downstream in 
https://github.com/apache/arrow-datafusion/issues/8738 (a query that used to 
run started erroring). 
   
   The cause of the issue is that the `LogicalPlan` and `ExecutionPlan`s 
schemas no longer match. I started exploring making them match in 
https://github.com/apache/arrow-datafusion/pull/8766
   
   However while working on that issue,  I thought more about this change and I 
am not sure the change described in this issue is good for multi column 
groupings at all.
   
   # Rationale
   
   Dictionary Encoding on single column group keys is bad  because there is no 
repetition in the data and therefore the Dictionary encoding is pure over head. 
For example, given                                                              
     
                                                                                
        
   ```sql
   SELECT ... GROUP BY a                                                        
        
   ```
   Dictionary Encoding is worse than native encoding:
   
   ```
   ┌──────┐                   ┌──────┐      ┌────┐     
   │ foo  │                   │ foo  │      │ 0  │     
   ├──────┤                   ├──────┤      ├────┤     
   │ bar  │                   │ bar  │      │ 1  │     
   ├──────┤                   ├──────┤      ├────┤     
   │ baz  │  ────────▶        │ baz  │      │ 2  │     
   ├──────┤                   ├──────┤      ├────┤     
   │  ff  │                   │  ff  │      │ 3  │     
   ├──────┤                   ├──────┤      ├────┤     
   │  ..  │                   │  ..  │      │ .. │     
   ├──────┤                   ├──────┤      ├────┤     
   │ aaz  │                   │ aaz  │      │9999│     
   └──────┘                   └──────┘      └────┘     
                                                       
    Group Values            values array   keys array  
      (distinct                                        
    values of a)                                       
   ```
   
   However, the story is different when there is a multi column group key, as 
in that case, dictionary encoding each column can be a significant performance 
improvement as they are applied to each column individually and each column may 
have substantial redundancy. For example, given this query
   
   ```sql
   SELECT ... GROUP BY a,b
   ```
   
   ```
   ┌──────┐  ┌──────┐               ┌──────┐      ┌────┐         ┌──────┐      
┌────┐      
   │ foo  │  │ tag1 │               │ foo  │      │ 0  │         │ tag1 │      
│ 0  │      
   ├──────┤  ├──────┤               ├──────┤      ├────┤         ├──────┤      
├────┤      
   │ foo  │  │ tag2 │               │ baz  │      │ 0  │         │ tag2 │      
│ 1  │      
   ├──────┤  ├──────┤               └──────┘      ├────┤         ├──────┤      
├────┤      
   │ foo  │  │ tag3 │ ────────▶                   │ 0  │         │ tag3 │      
│ 2  │      
   ├──────┤  ├──────┤                             ├────┤         ├──────┤      
├────┤      
   │ foo  │  │ tag4 │                             │ 0  │         │ tag4 │      
│ 3  │      
   ├──────┤  ├──────┤                             ├────┤         └──────┘      
├────┤      
   │  ..  │  │  ..  │                             │ .. │                       
│ .. │      
   ├──────┤  ├──────┤                             ├────┤                       
├────┤      
   │ baz  │  │ tag4 │                             │ 1  │                       
│ 3  │      
   └──────┘  └──────┘                             └────┘                       
└────┘      
                                                                                
           
          Group Values            values array   keys array    values array   
keys array   
       (distinct values of             (a)           (a)            (b)         
  (b)      
              a,b)                                                              
           
   ```
   
   This could especially impact multi-phase grouping where dictionary encoding 
will save significant time hashing values for low cardinality string columns.
   
   In fact we think we may have seen a performance regression when picking up 
this change downstream as well, which could also be explained by the 
observation abive
   
   Thus I recommend we revert this change via 
https://github.com/apache/arrow-datafusion/pull/8740 while we reconsider how to 
handle this case (maybe just for single column group by? Maybe do the 
dictionary encoding within the `RowEncoder` to avoid generating many redudant 
strings?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to