LuciferYang commented on issue #12263:
URL: https://github.com/apache/gluten/issues/12263#issuecomment-4677377382

   @jackylee-ch @FelixYBW Sorry for the late reply. I went through lance-spar, 
the lance Java SDK and gluten upmaster before answering, so this took a bit.
   
   From the lance-spark side: my personal read is that what Gluten needs here 
is small and shouldn't add much maintenance burden. On the technical substance, 
@malinjawi's audit matches what I see in the code, but I'd cut the list down to 
two changes, and adjust the first one:
   
   **1. Read: expose the Arrow C stream instead of `VectorSchemaRoot`.** Inside 
the lance SDK, `LanceScanner.scanBatches()` already gets the scan result as a 
native `FFI_ArrowArrayStream` and only then imports it into Java Arrow. If 
lance-spark exposes the stream at that level, Gluten can import it with its own 
Arrow copy, or directly into Velox, and the two projects never have to share 
Java Arrow objects. The reason this matters: Gluten on Spark 3.5 is on Arrow 
15.0.0 (vanilla, after #12244) and keeps 
`org.apache.arrow.{memory,vector,c,dataset}` unrelocated in its bundle jar; 
lance-spark's bundle also ships unrelocated `arrow-c-data`/`arrow-dataset`, 
while lance-core is built against Arrow 18.3. A `getVectorSchemaRoot()`-style 
accessor plus `ArrowWritableColumnVector.loadColumns(...)` only works if both 
sides load the same `org.apache.arrow.vector` classes, which today depends on 
classloader order. Passing C-ABI pointers sidesteps the whole problem.
   
   **2. Write: a public factory for `TaskCommit`.** It just wraps 
`List<FragmentMetadata>`, so this is a one-liner. 
`SparkWrite.getWriteOptions()` doesn't need to change — 
`LanceSparkWriteOptions.toWriteParams()` and all the individual getters are 
already public.
   
   The write path is actually in better shape than the thread assumed. 
`Fragment.write().data(ArrowArrayStream)` hands the raw stream pointer straight 
to JNI without materializing Java Arrow vectors, and the driver-side commit 
(`Transaction` + `Append` + `CommitBuilder`) is all public — it's exactly what 
`LanceBatchWrite.commit()` does internally. Lance also recently added 
`WriteFragmentBuilder.schema(...)` (lance-format/lance#6919, already in the 
8.0.0-beta.9 that lance-spark depends on) so executor-side fragment writes no 
longer need to open the dataset first. To be clear, lance-spark's own writer 
doesn't use these yet (it still goes through the deprecated `Fragment.create` 
without the schema override) — the point is that Gluten's columnar writer can 
call the SDK directly and only reuse lance-spark's driver-side commit. So 
@JkSelf's pipeline works today, no Spark API change required.
   
   One smaller thing: lance-spark uses a single process-wide unbounded 
`RootAllocator` (`LanceRuntime.allocator()`), not tied into Spark's task 
memory. Per-batch usage is bounded by `maxBatchBytes` so it's acceptable for a 
connector, but if Gluten wants the read path accounted we could additionally 
accept a caller-supplied `BufferAllocator`. Optional, can come later.
   
   If we converge on the C-ABI direction here, we can open an issue on the 
lance-spark side with the concrete API proposal and gather feedback from the 
other maintainers, and we can push this forward together from both ends. I'd 
suggest settling the Arrow question first since it decides the shape of the 
read API.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to