[PR] Add optional native Lance scan support [datafusion-comet]

via GitHub Fri, 12 Jun 2026 01:42:02 -0700


wirybeaver opened a new pull request, #4633:
URL: https://github.com/apache/datafusion-comet/pull/4633


   ## Which issue does this PR close?
   
   Closes #4632.
   
   ## Rationale for this change
   
   Comet already has a native table-scan path for Iceberg. Lance tables are 
currently planned and read through Lance Spark. This prototype keeps Lance 
Spark as the Spark planning contract, then lets an optional Comet contrib 
reader detect Lance V2 scans, extract a stable descriptor from Lance Spark, and 
execute the assigned Lance fragments through native Rust Lance APIs.
   
   The Lance Spark side of the descriptor contract is proposed in 
lance-format/lance-spark#624.
   
   ## What changes are included in this PR?
   
   - Adds an opt-in `contrib-lance` Maven profile and Rust `contrib-lance` 
feature.
   - Adds a small reflection-only Lance bridge in Comet core so default builds 
do not depend on Lance Spark.
   - Adds `spark.comet.scan.lanceNative.enabled`, disabled by default.
   - Extends scan planning to detect Lance `BatchScanExec` plans and delegate 
to contrib-lance when present and enabled.
   - Adds typed native proto support with `lance_scan = 118` and split-mode 
payloads.
   - Adds Scala contrib serialization/execution classes for Lance native scans.
   - Adds Rust native `LanceScanExec` using the Rust Lance API for dataset 
open, fragment selection, projection, filter SQL, limit/offset, batch size, and 
record batch streaming.
   
   This is intentionally a draft prototype. Minimal v1 scope is ordinary Lance 
table reads only. Index/search reads, namespace-backed credential refresh, 
metadata/version columns, aggregation pushdown, and production CI coverage are 
future phases.
   
   Known blocker before this can be merge-ready: packaged Comet currently 
contains `org.apache.arrow.c` classes rewritten against Comet's shaded Arrow 
allocator, while Lance Spark expects the normal Arrow C Data ABI. A packaged 
Spark smoke with both jars exposes this classpath conflict. We need an explicit 
Arrow C Data packaging/classloader strategy for Comet + Lance Spark before 
merging a production-ready native Lance reader.
   
   ## How are these changes tested?
   
   Passed:
   
   - `~/.cargo/bin/cargo check -p datafusion-comet --no-default-features`
   - `~/.cargo/bin/cargo check -p datafusion-comet --no-default-features 
--features contrib-lance`
   - `./mvnw test -Dtest=none 
-Dsuites="org.apache.comet.rules.CometScanRuleSuite" -Pspark-4.1,contrib-lance 
-Dscalastyle.skip=true`
   - `./mvnw package -DskipTests -Pspark-4.1,contrib-lance 
-Dscalastyle.skip=true`
   
   Smoke attempted:
   
   - `source ~/uvenv/common/bin/activate && python 
/home/user/draft/comet_lance_native_smoke.py`
   
   The smoke writes and reads a local Lance dataset, but packaged Comet + Lance 
Spark currently fails at runtime with an Arrow C Data ABI/classpath conflict as 
described above. The draft PR keeps that blocker visible for design review 
instead of hiding it behind unit-only coverage.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add optional native Lance scan support [datafusion-comet]

Reply via email to