rluvaton opened a new issue, #18308:
URL: https://github.com/apache/datafusion/issues/18308

   Currently, scalar information is lost when data moves between execution 
plans. While `PhysicalExpr` and `ScalarUDFImpl` work with `ColumnarValue` 
(enabling optimized implementations for scalars), the stream of `RecordBatch`es 
between `ExecutionPlan`s doesn't preserve scalar knowledge.
   
   ### Proposed Solution
   
   Allow `ExecutionPlan` to return a stream that preserves scalar information 
(will return a stream of `ReturnedValue`):
   
   ```rust
   enum ReturnedValue {
       Batch(RecordBatch),
       BatchWithScalars(RecordBatchWithScalars)
   }
   
   /// Same as ColumnarValue but with Arc-wrapped Scalar to avoid unnecessary 
copying
   enum ColumnarValue {
      Array(ArrayRef),
      Scalar(Arc<ScalarValue>),
   }
   
   struct RecordBatchWithScalars {
       schema: SchemaRef,
       columns: Vec<ColumnarValue>,
       row_count: usize,
   }
   ```
   
   ### Benefits
   
   #### Sort Operations
   - **Skip sorting**: When a column contains a scalar value (same value for 
the entire batch), sorting on that column can be skipped
   - **Efficient copying**: When sorting by other columns, scalar columns can 
be copied more efficiently or only partially
   - **Produce scalars**: Sort operations can output scalars for faster 
downstream operations
   
   #### Aggregation
   - **Avoid expensive operations**: When grouping by a scalar column, 
expensive hashing/comparison can be avoided
   
   ### Downsides
   1. Significant breaking change
   2. Increased code complexity
   
   ### Potential Extensions
   
   We could add another variant for row-based encoding:
   ```rust
   ReturnedValue::EncodedRows(Rows)
   ```
   
   This would allow operators that process data as rows to pass encoded rows 
directly to the next operator, avoiding unnecessary conversions between 
columnar and row formats. The receiving operator can then decide whether to:
   - Use row-based input directly (avoiding conversion overhead if it would 
benefit from row based as well)
   - Convert to columnar format (same cost as current behavior)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to