malinjawi commented on issue #12263:
URL: https://github.com/apache/gluten/issues/12263#issuecomment-4673911362

   > I remember we have a PR to convert Spark's columnar to Velox directly
   
   @FelixYBW that's #7392 (`ArrowColumnarToVeloxColumnarExec`). It's not 
disabled, it just never triggers here: the transition only fires for plan nodes 
that declare Gluten's Arrow batch types, and a vanilla `BatchScanExec` over 
lance-spark declares the vanilla batch type. There's no vanilla->Velox 
transition registered since Gluten can't assume an arbitrary `ColumnarBatch` is 
Arrow-backed. Hence C2R in Spark + R2C in Gluten.
   
   I went through lance-spark's code to figure out what the integration 
actually needs from their side. Turns out almost everything is already public, 
and the "add interface in lance-spark vs copy code like iceberg/delta" question 
comes down to three small accessors:
   
   1. read: no public way to get the raw Arrow data. 
`LanceFragmentColumnarBatchScanner` doesn't expose its `VectorSchemaRoot` and 
`LanceArrowColumnVector` doesn't expose the wrapped `ValueVector`. Either one 
works for us.
   2. write: `LanceBatchWrite.TaskCommit`'s constructor is package-private, we 
need a public factory to produce commit messages their driver-side `commit()` 
accepts.
   3. write: `SparkWrite.getWriteOptions()` is package-private 
(`LanceSparkWriteOptions.toWriteParams()` itself is public).
   
   With those, the read side is what @zhztheplayer suggested: an 
`OffloadLanceScan` rule (same shape as `OffloadIcebergScan`) swapping the scan 
for an exec that declares `ArrowJavaBatchType`, rewrapping via 
`ArrowWritableColumnVector.loadColumns()` without copying, and the existing 
transitions handle the rest. Since this wraps the already-planned `LanceScan`, 
lance's pushdowns (filter, count-star, TopN, zonemap pruning) are preserved. 
One caveat: the scanner also emits non-Arrow virtual vectors (fragment-id 
constant, blob position/size), so the rule needs validation fallback for those 
projections.
   
   For the write side, `LanceDataWriter` already ends in 
`Fragment.create(datasetUri, arrowStream, params)`, which is @JkSelf's pipeline 
with one adjustment: per-task `Fragment.create` rather than `Dataset.write()`, 
with the single commit staying on the driver, which is what lance-spark already 
does. So a Gluten writer (via `ColumnarV2TableWriteExec`, same as iceberg 
native write) can export Velox batches through the C data interface into that 
same call and skip the row->Arrow buffering entirely. No Spark API change 
needed to ship this, though the Arrow I/O API in Spark is still worth pursuing 
long term.
   
   Btw the repro above uses the `com.lancedb` 0.0.15 bundle, which is the last 
release under those coordinates. Releases moved to `org.lance` after the 
package rename (lance-format/lance-spark#133), currently 0.5.1, so even the 
catalog class name in that config differs now. Everything I checked above is 
against 0.5.1.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to