malinjawi commented on issue #12263: URL: https://github.com/apache/gluten/issues/12263#issuecomment-4673911362
> I remember we have a PR to convert Spark's columnar to Velox directly @FelixYBW that's #7392 (`ArrowColumnarToVeloxColumnarExec`). It's not disabled, it just never triggers here: the transition only fires for plan nodes that declare Gluten's Arrow batch types, and a vanilla `BatchScanExec` over lance-spark declares the vanilla batch type. There's no vanilla->Velox transition registered since Gluten can't assume an arbitrary `ColumnarBatch` is Arrow-backed. Hence C2R in Spark + R2C in Gluten. I went through lance-spark's code to figure out what the integration actually needs from their side. Turns out almost everything is already public, and the "add interface in lance-spark vs copy code like iceberg/delta" question comes down to three small accessors: 1. read: no public way to get the raw Arrow data. `LanceFragmentColumnarBatchScanner` doesn't expose its `VectorSchemaRoot` and `LanceArrowColumnVector` doesn't expose the wrapped `ValueVector`. Either one works for us. 2. write: `LanceBatchWrite.TaskCommit`'s constructor is package-private, we need a public factory to produce commit messages their driver-side `commit()` accepts. 3. write: `SparkWrite.getWriteOptions()` is package-private (`LanceSparkWriteOptions.toWriteParams()` itself is public). With those, the read side is what @zhztheplayer suggested: an `OffloadLanceScan` rule (same shape as `OffloadIcebergScan`) swapping the scan for an exec that declares `ArrowJavaBatchType`, rewrapping via `ArrowWritableColumnVector.loadColumns()` without copying, and the existing transitions handle the rest. Since this wraps the already-planned `LanceScan`, lance's pushdowns (filter, count-star, TopN, zonemap pruning) are preserved. One caveat: the scanner also emits non-Arrow virtual vectors (fragment-id constant, blob position/size), so the rule needs validation fallback for those projections. For the write side, `LanceDataWriter` already ends in `Fragment.create(datasetUri, arrowStream, params)`, which is @JkSelf's pipeline with one adjustment: per-task `Fragment.create` rather than `Dataset.write()`, with the single commit staying on the driver, which is what lance-spark already does. So a Gluten writer (via `ColumnarV2TableWriteExec`, same as iceberg native write) can export Velox batches through the C data interface into that same call and skip the row->Arrow buffering entirely. No Spark API change needed to ship this, though the Arrow I/O API in Spark is still worth pursuing long term. Btw the repro above uses the `com.lancedb` 0.0.15 bundle, which is the last release under those coordinates. Releases moved to `org.lance` after the package rename (lance-format/lance-spark#133), currently 0.5.1, so even the catalog class name in that config differs now. Everything I checked above is against 0.5.1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
