minni31 opened a new pull request, #12077:
URL: https://github.com/apache/gluten/pull/12077

   ## CONTEXT
   
   `RDDScanExec` is Spark's physical plan node used when creating DataFrames 
from in-memory RDDs
   via `sparkSession.createDataset(rdd)` or `createDataFrame(rdd, schema)`. 
Currently, when Gluten
   encounters this node with the Velox backend, it falls back to vanilla 
Spark's row-based execution.
   The ClickHouse backend already supports this via `CHRDDScanTransformer`.
   
   ## WHAT
   
   This PR adds native Velox execution support for `RDDScanExec` by 
implementing a
   `VeloxRDDScanTransformer`. The transformer converts `RDD[InternalRow]` into 
Velox columnar
   batches using the existing `RowToVeloxColumnarExec` JNI infrastructure, so 
no new native code
   is needed.
   
   Key design decisions:
   
   - **Reuses existing infrastructure**: The transformer delegates to 
`RowToVeloxColumnarExec`
     for the actual row-to-columnar conversion, keeping the implementation lean 
and consistent
     with how Velox already handles row-based input.
   - **Schema validation**: `doValidateInternal` rejects complex types (ARRAY, 
MAP, STRUCT) that
     the native row-to-columnar converter doesn't support, ensuring clean 
fallback to vanilla Spark
     for unsupported schemas rather than a runtime crash.
   - **Leaf node correctness**: `withNewChildrenInternal` returns `this` since 
`RDDScanTransformer`
     is a leaf node with no children.
   - **Follows existing patterns**: Mirrors the structure of 
`CHRDDScanTransformer` in the
     ClickHouse backend.
   
   ## Changes
   
   - **`VeloxRDDScanTransformer.scala`** (new) — Columnar execution node 
wrapping
     `RowToVeloxColumnarExec` for native row-to-columnar conversion.
   - **`VeloxSparkPlanExecApi.scala`** (modified) — Overrides 
`isSupportRDDScanExec` and
     `getRDDScanTransform` to wire up the new transformer.
   - **`VeloxRDDScanSuite.scala`** (new) — 7 unit tests covering plan 
replacement, type coverage,
     aggregation, empty RDD, null values, idempotent reads, and all primitive 
types.
   
   ## Test Results
   
   All **7 unit tests** passed on the internal CI pipeline (build 218528457):
   
   | Test Name | Status |
   |-----------|--------|
   | basic RDDScanExec is replaced by VeloxRDDScanTransformer | ✅ |
   | RDDScan with string and numeric types | ✅ |
   | RDDScan with aggregation downstream | ✅ |
   | RDDScan with empty RDD | ✅ |
   | RDDScan preserves data correctness with multiple re-reads | ✅ |
   | RDDScan with null values | ✅ |
   | RDDScan with all supported primitive types | ✅ |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to