Re: [PR] build: Cargo workspace + native-common extraction (1/6) [datafusion-java]

via GitHub Fri, 12 Jun 2026 07:41:21 -0700


paleolimbot commented on PR #104:
URL: https://github.com/apache/datafusion-java/pull/104#issuecomment-4692279322

Very cool! I'm happy to review this work because it's very relevant to what
we do (Spark via Sedona Spark and DataFusion via SedonaDB), with the limitation
that I don't have any experience with JNI (just general arrow-over-c). I do
have a few questions about the approach before I get started.

Notably, these are a few ways to avoid maintaining a pile of JNI (by
exploiting other piles of JNI that other people have written already!)

One approach is to use ADBC as the FFI bridge. As a TableProvider author, a
thin layer of code (a few hundred lines) can get you a cdylib with the ADBC
entrypoint defined, and on the Java side there is already a driver manager that
can import the cdylib and "run a SQL query", getting one FFI_ArrowArrayStream
per partition back. You have to figure out how to squeeze your scan options
into the ADBC interface but you're already serializing/deserializing most of
it. It might look like `statement.set_option("my_provider.options",
"<bytes>")`, `statement.set_option("my_provider.projection", "1,2,3,4,")`,
`statement.set_option("my_provider.filter", "<serialized_filter>")`. and
`statement.execute_partitions()` would just take an empty string. You could
also do something fancier with Substrait (ADBC has a
`statement.set_query_substrait()`). There are two pieces of prior art here to
start from (
https://github.com/apache/sedona-db/tree/f9a9227cfa41cb71448d28c21330996fb657a3c1/rust/sedona-ad
bc/src , https://github.com/adbc-drivers/datafusion ).

Another approach is to extract the FFI of the table provider you actually
need and add it to `datafusion-ffi`. You would still need some JNI on the Java
side to import the entrypoints you've defined but because the surface is
smaller than ADBC it's potentially less work. This has the advantage that you
can shape it around the TableProvider more explicitly and iterate. It is also
closer to the shape that Comet will need (cdylib + entrypoint serialized to
protobuf), although there are probably a few shapes that would work there.

At a high level I think my complaint about the approach in this stack is
that you are proposing to build a cdylib that exports JNI entrypoints, whereas
a cleaner approach (in my opinion) is to build a cdylib that exports
entrypoints that just use the Arrow C Data/Stream interface and C types. That
also has broader applicability to non-Java (i.e., can live in datafusion proper
and get eyes/reviews from a wider audience).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] build: Cargo workspace + native-common extraction (1/6) [datafusion-java]

Reply via email to