Re: [PR] chore: Improve Arrow FFI documentation [datafusion-comet]

via GitHub Mon, 18 Aug 2025 13:53:12 -0700


parthchandra commented on code in PR #2163:
URL: https://github.com/apache/datafusion-comet/pull/2163#discussion_r2283402464



##########
native/core/src/execution/jni_api.rs:
##########
@@ -361,11 +361,11 @@ fn prepare_output(
 
                 new_array
                     .to_data()
-                    .move_to_spark(array_addrs[i], schema_addrs[i])?;
+                    .move_to_ffi(array_addrs[i], schema_addrs[i])?;

Review Comment:
   `ffi` is really just the mechanism, isn't it? It doesn't itself have 
ownership at any point in time. This method does in fact move the ownership to 
Spark. Or am I mistaken?



##########
docs/source/contributor-guide/plugin_overview.md:
##########
@@ -140,8 +140,31 @@ accessing Arrow data structures from multiple languages.
 
 [Arrow C Data Interface]: 
https://arrow.apache.org/docs/format/CDataInterface.html
 
-- `CometExecIterator` invokes native plans and uses Arrow FFI to read the 
output batches
-- Native `ScanExec` operators call `CometBatchIterator` via JNI to fetch input 
batches from the JVM
+### Array Ownership and Lifecycle
+
+#### Importing Batches from Native to JVM
+
+`CometExecIterator` invokes native plans by calling the JNI function 
`executePlan`. The ownership of the output 
+batches, which are created in native code, is transferred to FFI ready to be 
consumed by Java once the `executePlan` 
+function returns.
+
+Once the JVM code finishes processing the arrays, it will call the `release` 
callback, which invokes native code 
+to release the memory that was allocated on the native side.
+
+#### Exporting Batches from JVM to Native
+
+The leaf nodes of native plans are often `ScanExec` operators, which call 
`CometBatchIterator` via JNI to fetch 
+input batches from the JVM.
+
+Note that this approach does not follow best practice and does not benefit 
from zero-copy transfer.
+
+The incoming array data is owned by the JVM and can be freed or reused on the 
JVM side during the next call to

Review Comment:
   Is this correct? `native_comet` uses JVM to read data from the data file(s). 
This data is passed to the native side as a java byte array when the native 
code calls  `org.apache.comet.parquet.Native.setPageV1`. From what I understand 
the native decoder decodes this byte array and constructs arrow vectors.  This 
is where the memory ownership gets fuzzy because, from what I now understand, 
the vectors created by the decoder are backed by mutable buffers _which can be 
reused_. At this point the memory ownership model becomes unsafe. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: Improve Arrow FFI documentation [datafusion-comet]

Reply via email to