paleolimbot commented on PR #104:
URL: https://github.com/apache/datafusion-java/pull/104#issuecomment-4692279322

   Very cool! I'm happy to review this work because it's very relevant to what 
we do (Spark via Sedona Spark and DataFusion via SedonaDB), with the limitation 
that I don't have any experience with JNI (just general arrow-over-c). I do 
have a few questions about the approach before I get started.
   
   Notably, these are a few ways to avoid maintaining a pile of JNI (by 
exploiting other piles of JNI that other people have written already!)
   
   One approach is to use ADBC as the FFI bridge. As a TableProvider author, a 
thin layer of code (a few hundred lines) can get you a cdylib with the ADBC 
entrypoint defined, and on the Java side there is already a driver manager that 
can import the cdylib and "run a SQL query", getting one FFI_ArrowArrayStream 
per partition back. You have to figure out how to squeeze your scan options 
into the ADBC interface but you're already serializing/deserializing most of 
it. It might look like `statement.set_option("my_provider.options", 
"<bytes>")`, `statement.set_option("my_provider.projection", "1,2,3,4,")`, 
`statement.set_option("my_provider.filter", "<serialized_filter>")`. and 
`statement.execute_partitions()` would just take an empty string. You could 
also do something fancier with Substrait (ADBC has a 
`statement.set_query_substrait()`). There are two pieces of prior art here to 
start from ( 
https://github.com/apache/sedona-db/tree/f9a9227cfa41cb71448d28c21330996fb657a3c1/rust/sedona-ad
 bc/src , https://github.com/adbc-drivers/datafusion ).
   
   Another approach is to extract the FFI of the table provider you actually 
need and add it to `datafusion-ffi`. You would still need some JNI on the Java 
side to import the entrypoints you've defined but because the surface is 
smaller than ADBC it's potentially less work. This has the advantage that you 
can shape it around the TableProvider more explicitly and iterate. It is also 
closer to the shape that Comet will need (cdylib + entrypoint serialized to 
protobuf), although there are probably a few shapes that would work there.
   
   At a high level I think my complaint about the approach in this stack is 
that you are proposing to build a cdylib that exports JNI entrypoints, whereas 
a cleaner approach (in my opinion) is to build a cdylib that exports 
entrypoints that just use the Arrow C Data/Stream interface and C types. That 
also has broader applicability to non-Java (i.e., can live in datafusion proper 
and get eyes/reviews from a wider audience).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to