sezruby commented on issue #12263: URL: https://github.com/apache/gluten/issues/12263#issuecomment-4683035652
A few additional points worth deciding explicitly before we file the lance-spark issue, on top of the direction @LuciferYang and @FelixYBW already agreed on. Mostly about scope and edges — none of this changes the C-ABI design. **1. `COUNT(*)` carve-out.** Lance has `LanceScanner.countRows()` (a JNI shortcut, no scan, no Arrow), and lance-spark already wires it via `LanceCountStarPartitionReader`. Gluten's offload rule shouldn't match that case — Velox can't beat a native row count. One predicate in the rule, prevents an obvious regression on a very common query. **2. Virtual columns (`_fragid`, `<col>__blob_position`, `<col>__blob_size`).** These are appended Java-side in `LanceFragmentColumnarBatchScanner.loadNextBatch` lines 65-73, after the Arrow import — they're NOT in the C-stream. Three options: - (a) Validator falls back to vanilla when projected. Simple, safe; loses offload on these queries. - (b) Synthesize them in a Velox `Project` node above the offloaded scan (`_fragid` is a per-partition literal int; blob position/size are deterministic transforms over a `StructVector`). Keeps offload, but bakes lance-domain logic into gluten. - (c) Push them into the native scanner via a future lance-core change. Out of scope. I'd ship (a) for v1 and revisit (b) once the integration lands. **3. Row-level ops out of scope.** `MERGE`/`UPDATE`/`DELETE` go through `SparkPositionDeltaWrite`; backfills go through `AddColumnsBackfillWrite` / `UpdateColumnsBackfillWrite`. Gluten doesn't offload row-level ops on any source today (iceberg falls back too). Explicit: rule matches plain `BatchScanExec` (read) + plain `LanceBatchWrite` (insert/CTAS) only. Everything else falls through unchanged. **4. Memory accounting is a non-issue with this design.** `LanceRuntime.allocator()`'s unbounded `RootAllocator` worries me less than I initially thought — gluten allocates the `ArrowArrayStream` from `ArrowBufferAllocators.contextInstance()` (which IS wired into Spark `TaskMemoryManager` via `ArrowReservationListener`), and the C-data release callback pulls released memory back through that allocator. Lance-spark's allocator never holds anything for gluten on the read path. So no need to add a caller-supplied `BufferAllocator` API on lance-spark — the natural ownership boundary already does the right thing. **5. `WriteFragmentBuilder.schema(LanceSchema)` is an optimization, not a requirement.** Verified on `lance v8.0.0-beta.9` (lance-spark's pinned version) — `Fragment.write().data(ArrowArrayStream).execute()` already passes only `stream.memoryAddress()` across JNI; the schema travels with the stream's `get_schema` callback. So `schema(...)` is useful for skipping a `dataset.open()` round-trip per task, but the gluten writer is correct without it. Adopt as a perf follow-up. **6. Phasing.** Worth being explicit: - **Phase 1 (JVM-only):** All three lance-spark API changes + `gluten-lance` module with `OffloadLanceScan`, `LanceScanTransformer`, `LanceWriteTransformer`. No cpp changes. Reuses `ArrowAbiUtil.importToSparkColumnarBatch` + the existing `ArrowColumnarToVeloxColumnarExec` transition. Ships the integration. - **Phase 2 (cpp-side):** Direct C-stream import into Velox via `Bridge.cpp`, skipping the Arrow→Velox transition. Pure perf, no behavior change, ships independently after Phase 1 lands. - **Phase 3 (when upstream lands):** If velox#16556 lands cleanly, expose it as an alternative path. lance-spark stays the default. If lance-core adds native virtual-column support, switch (a)→(c) above. **7. What survives lance-spark planning.** For the record, since it's worth being clear: the DSv2 `Scan` builder runs before gluten rules, so filter pushdown, column pruning, kNN prefilter, TopN, zonemap pruning, fragment skipping, and namespace integration all stay intact. Gluten only intercepts the per-batch handoff, not the planning. Aggregations / joins / windows offload to Velox above the scan as usual. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
