ctsk commented on issue #6692:
URL: https://github.com/apache/arrow-rs/issues/6692#issuecomment-2785965705

   I've been pondering the problem of an "appending" take kernel and would like 
to share my thoughts.
   
   ### Coming from DataFusion...
   
   Based on the experimental implementation of `take_in`, I believe the 
interface needed for DataFusion's use case looks like this:
   
   ```rust
   trait ColumnSink {
       fn take(&mut self, source: &dyn Array, indices: &dyn Array);
       fn emit_finished(&mut self) -> Option<Arc<dyn Array>>;
       fn flush(&mut self);
   }
   ```
   
   ColumnSink maintains internal state, with `ColumnSink::take` logically 
appending the result of the `take` operation to its internal state. A key 
property is that ColumnSink can output its state incrementally rather than 
requiring a single vector output. This makes ColumnSink at least as capable as 
DataFusion's current approach, as it can fall back to using the current `take` 
kernel and leave concatenation to a `CoalesceBatchesExec`.
   
   I believe this abstraction belongs to DataFusion rather than arrow-rs.
   
   ### Allocation-minimal kernels in arrow-rs
   
   Ideally, ColumnSink would produce perfectly-sized batches of a given size, 
requiring it to control the destination area's sizing. This calls for a version 
of take that writes into a provided buffer. I propose:
   
   ```rust
   fn take_and_append(
      source: &dyn Array,
      indices: &dyn Array,
      destination: ???
   ) -> Result<(), ArrowError>;
   ```
   
   The question is what type should replace `???`. I've considered these 
options:
   
   #### Option 1: `Box<dyn ArrayBuilder>` | `&mut dyn ArrayBuilder`
   This was the starting point for this issue. ArrayBuilder is convenient 
because it already contains the appropriate data structures. While 
`PrimitiveArrays` hold all state in the result array, for `DictionaryArrays`, 
we'd want the destination to hold a hash table that won't be in the finished 
array.
   
   The challenge with using ArrayBuilder is that builders don't sufficiently 
expose their internals for kernel implementations.
   
   Potential approaches:
   1. **Expand the builder interface**: Add methods giving callers more control 
over builder mutation. The challenge is finding reusable abstractions between 
builders, as well as balancing between `unsafe` methods that might leave the 
builder in an inconsistent state and enforcing consistency after each 
manipulation.
   
   2. **Deconstruction approach**: Each builder would implement `into_parts` 
and `new`/`new_unchecked`, similar to what exists for Arrays. Kernels could 
arbitrarily mutate these buffers and then reconstruct the builder. This seems 
cleaner than expanding the builder interface further.
   
   The deconstruction approach has the drawback of requiring ownership of the 
builder, whereas a mutable reference would make more sense semantically.
   
   #### Option 2: `&mut MutableArray`
   Taking the deconstruction approach further, we could introduce mutable 
counterparts (`MutablePrimitiveArray<T>`, etc.) for each array type as the 
output of `into_parts`.
   
   We would add a type-erased MutableArray abstraction on top. This type would 
essentially hold the same state as a builder but could be more freely mutated 
without many consistency guarantees. Consistency would be checked when 
converting from MutableArray to Array.
   
   Each concrete mutable array would expose methods to manipulate each of its 
fields, making its internal data and workings transparent.
   
   The trait could be:
   
   ```rust
   trait MutableArray {
      fn data_type(&self) -> DataType;
      fn len(&self) -> usize;
      fn capacity(&self) -> usize;
      fn into_array(self) -> Arc<dyn Array>;
      fn into_builder(self) -> Arc<dyn ArrayBuilder>;
   }
   ```
   
   It wouldn't need common methods to manipulate the MutableArray since kernels 
would downcast it before manipulation. It *could* feature a method to modify 
the optional null buffer.
   
   *All names are subject to change*
   
   Thoughts?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to