alamb commented on issue #6906:
URL: https://github.com/apache/datafusion/issues/6906#issuecomment-2355629355

   # Potential Design
   
   One high level idea is to build a data structure that uses the same internal 
format (views/buffers) as 
[`StringViewArray`](https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html)
 in Arrow:
   
   ```
                     ┌───────────────────────────────────────────┐  
                     │                        Stored in Vec<u8>  │  
                     │   Stored in a         ┌─────────────────┐ │  
                     │     Vec<u128>         │ ┌───────────┐   │ │  
                     │ ┌─────────────┐    ─ ─│▶│some value │   │ │  
      ┌─────┐        │ │ ┌─────────┐ │   │   │ └───────────┘   │ │  
      │  0  │────────┼─┼▶│  View   │ │       │                 │ │  
      ├─────┤        │ │ ├─────────┤ │   │   │                 │ │  
      │  1  │────────┼─┼▶│  View   │─│─ ─ ┐  │      ...        │ │  
      └─────┘        │ │ └─────────┘ │   │   │                 │ │  
        ...          │ │    ...      │    │  │                 │ │  
      ┌─────┐        │ │ ┌─────────┐ │   │   │                 │ │  
      │ N-2 │        │ │ │  View   │─│─ ─ │  │   ┌────────────┐│ │  
      ├─────┤        │ │ ├─────────┤ │     ─ ┼ ─▶│other value ││ │  
      │ N-1 │────────┼─┼▶│  View   │ │       │   └────────────┘│ │  
      └─────┘        │ │ └─────────┘ │       └─────────────────┘ │  
                     │ └─────────────┘  String values are stored │  
                     │                  inline or in extra byte  │  
    Logical group    │                  buffer                   │  
       number        └───────────────────────────────────────────┘  
                      New structure: MutableStringViewBuilder       
                                                                    
                      Current Min/Max value for that group stored in
                      same format as StringViewArray                
                                                                    
   ```
   
   In this design, the current value for  each group is stored in two parts (as 
described on [arrow 
docs](https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html))
   1. a fixed size `u128` view 
   2. a variable length part with the string data
   
   As new batches are updated, each `View` is updated if necessary
   
   Benefits of design:
   1. Hopefully use the same code as in arrow-rs
   4. Allows Zero copy conversion to StringView / BinaryView at output
   5. Use inlined values for quick min/max comparison
   
   I believe (though we will have to verify it) that the conversion from 
MutableStringViewBuilder to just StringArray (not StringViewArray) should also 
be better than the current `GroupsAccumulatorAdapter`. Both conversions need to 
copy the string bytes again into the packed `StringArray` format, but the 
`GroupsAccumulatorAdapter` also has to allocate/free owned `String`s as well
   
   ## Potential challenges
   
   I think the trickiest part of this code, other the low level code 
optimizations is that as min/max values are replaced, data in the variable 
length buffer will become "garbage" (not reachable) thus consuming more memory 
than necessary:
   
   ```
                           Stored in Vec<u8>   
                        ┌────────────────────┐ 
                        │ ┌────────────────┐ │ 
   ┌─────────────┐      │ │prev max value 1│ │ 
   │ ┌─────────┐ │      │ └────────────────┘ │ 
   │ │  View   │─│─ ┐   │        ...         │ 
   │ └─────────┘ │      │ ┌────────────────┐ │ 
   │     ...     │  │   │ │prev max value m│ │ 
   │             │      │ └────────────────┘ │ 
   │             │  │   │ ┌────────────────┐ │ 
   │             │   ─ ─│▶│prev max value m│ │ 
   │             │      │ └────────────────┘ │ 
   │             │      │        ...         │ 
   │             │      └────────────────────┘ 
   │             │                             
   └─────────────┘      Previous min/max values
                        are not pointed to     
                        anymore and need to be 
                        cleaned up             
   ```
   
   I think this means the code will need something 
[`GenericByteViewArray::gc`](https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.gc)
 run occasionally
   
   ## Random Thoughts
   
   Thoughts: maybe this structure (MutableStringViewBuilder??) could be 
upstreamed eventually


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to