minni31 opened a new pull request, #12077:
URL: https://github.com/apache/gluten/pull/12077
## CONTEXT
`RDDScanExec` is Spark's physical plan node used when creating DataFrames
from in-memory RDDs
via `sparkSession.createDataset(rdd)` or `createDataFrame(rdd, schema)`.
Currently, when Gluten
encounters this node with the Velox backend, it falls back to vanilla
Spark's row-based execution.
The ClickHouse backend already supports this via `CHRDDScanTransformer`.
## WHAT
This PR adds native Velox execution support for `RDDScanExec` by
implementing a
`VeloxRDDScanTransformer`. The transformer converts `RDD[InternalRow]` into
Velox columnar
batches using the existing `RowToVeloxColumnarExec` JNI infrastructure, so
no new native code
is needed.
Key design decisions:
- **Reuses existing infrastructure**: The transformer delegates to
`RowToVeloxColumnarExec`
for the actual row-to-columnar conversion, keeping the implementation lean
and consistent
with how Velox already handles row-based input.
- **Schema validation**: `doValidateInternal` rejects complex types (ARRAY,
MAP, STRUCT) that
the native row-to-columnar converter doesn't support, ensuring clean
fallback to vanilla Spark
for unsupported schemas rather than a runtime crash.
- **Leaf node correctness**: `withNewChildrenInternal` returns `this` since
`RDDScanTransformer`
is a leaf node with no children.
- **Follows existing patterns**: Mirrors the structure of
`CHRDDScanTransformer` in the
ClickHouse backend.
## Changes
- **`VeloxRDDScanTransformer.scala`** (new) — Columnar execution node
wrapping
`RowToVeloxColumnarExec` for native row-to-columnar conversion.
- **`VeloxSparkPlanExecApi.scala`** (modified) — Overrides
`isSupportRDDScanExec` and
`getRDDScanTransform` to wire up the new transformer.
- **`VeloxRDDScanSuite.scala`** (new) — 7 unit tests covering plan
replacement, type coverage,
aggregation, empty RDD, null values, idempotent reads, and all primitive
types.
## Test Results
All **7 unit tests** passed on the internal CI pipeline (build 218528457):
| Test Name | Status |
|-----------|--------|
| basic RDDScanExec is replaced by VeloxRDDScanTransformer | ✅ |
| RDDScan with string and numeric types | ✅ |
| RDDScan with aggregation downstream | ✅ |
| RDDScan with empty RDD | ✅ |
| RDDScan preserves data correctness with multiple re-reads | ✅ |
| RDDScan with null values | ✅ |
| RDDScan with all supported primitive types | ✅ |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]