alamb opened a new issue, #4466:
URL: https://github.com/apache/arrow-rs/issues/4466

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   I am implementing GroupByHash in DataFusion 
https://github.com/apache/arrow-datafusion/issues/4973
   
   We use the `RowFormat` to store grouping keys which is awesome. 
   
   The Grouping operation calculates the `Row` format for each input row, 
determines if they have been seen previously, and if not stores the newly seen 
`Row`. The only way I know of today is to copy each new row individually using 
[`owned()`](https://docs.rs/arrow-row/42.0.0/arrow_row/struct.Row.html#method.owned):
   
   ```
   ┌──────────────────────────────────┐                                         
                   
   │ ┌───────────────────────────────┐│                                         
                   
   │ │               A               ││                                         
                   
   │ ├───────────────────────────────┤│                                         
                   
   │ │               B               │├────────────┐                            
                   
   │ ├───────────────────────────────┤│            │                            
                   
   │ │               A               ││            │                            
                   
   │ ├───────────────────────────────┤│            │                            
                   
   │ │               A               ││            │           
┌──────────────────────────────────┐
   │ ├───────────────────────────────┤│            │           │ 
┌───────────────────────────────┐│
   │ │               C               ││            │           │ │              
 A               ││
   │ ├───────────────────────────────┤│            │           │ 
└───────────────────────────────┘│
   │ │               B               ││            │           │ 
┌───────────────────────────────┐│
   │ ├───────────────────────────────┤│            └───────────┼▶│              
 B               ││
   │ │               A               ││                        │ 
└───────────────────────────────┘│
   │ ├───────────────────────────────┤│  to add a new row, I   │                
                  │
   │ │               A               ││  currently do          │                
                  │
   │ └───────────────────────────────┘│  `Row::owned()` to     │                
                  │
   │  group keys for input batch      │  get a copy            │   distinct 
group keys seen in    │
   │  often many repeated values      │                        │   previous 
batches               │
   │                                  │                        │                
                  │
   └──────────────────────────────────┘                        
└──────────────────────────────────┘
                                                                                
                   
        arrow_row::Rows                                         
Vec<arrow_row::OwnedRow>           
                                                                                
                   
   ```
   
   **Describe the solution you'd like**
   I would like to be able to append a `Row` directly to a `Rows`:
   
   ```
   ┌──────────────────────────────────┐                                         
                   
   │ ┌───────────────────────────────┐│                                         
                   
   │ │               A               ││                                         
                   
   │ ├───────────────────────────────┤│                                         
                   
   │ │               B               │├────────────┐                            
                   
   │ ├───────────────────────────────┤│            │                            
                   
   │ │               A               ││            │                            
                   
   │ ├───────────────────────────────┤│            │                            
                   
   │ │               A               ││            │           
┌──────────────────────────────────┐
   │ ├───────────────────────────────┤│            │           │ 
┌───────────────────────────────┐│
   │ │               C               ││            │           │ │              
 A               ││
   │ ├───────────────────────────────┤│            │           │ 
├───────────────────────────────┤│
   │ │               B               ││            └───────────┼▶│              
 B               ││
   │ ├───────────────────────────────┤│                        │ 
└───────────────────────────────┘│
   │ │               A               ││                        │                
                  │
   │ ├───────────────────────────────┤│  Copying a new Row     │                
                  │
   │ │               A               ││  would just copy       │                
                  │
   │ └───────────────────────────────┘│  some bytes to the     │                
                  │
   │  group keys for input batch      │  other Rows            │   distinct 
group keys seen in    │
   │  often many repeated values      │                        │   previous 
batches               │
   │                                  │                        │                
                  │
   └──────────────────────────────────┘                        
└──────────────────────────────────┘
                                                                                
                   
      arrow_row::Rows                                            
arrow_row::Rows                   
                                                                                
                   
   ```
   
   **Describe alternatives you've considered**
   
   Currently my POC code uses `Vec<OwnedRow>` which adds an extra allocation 
for each row 😢 
   
   **Additional context**
   https://github.com/apache/arrow-datafusion/issues/4973


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to