timsaucer commented on issue #823:
URL: 
https://github.com/apache/datafusion-python/issues/823#issuecomment-2341714008

   I've done some additional testing with mixed success.
   
   ## Approach 1: Direct Expose
   
   In this approach we basically just expose a function 
`register_table_provider` that takes a `Arc<dyn TableProvider>` and put this in 
a simple structure we can expose via PyCapsule. 
   
   - Advantages: This is *simple* in terms of `datafusion-python`. It's really 
a matter of exposing the PyCapsule and puts all of the responsibility of making 
it work on the consumer.
   - Disadvantages: This requires not only the datafusion version to match 
exactly between the consumer and `datafusion-python` but they must also use the 
same arrow dependencies *and* the same compiler version to get compatible 
binaries. This means that every version of delta-rs would have to update in 
lock step with datafusion - including sub dependencies AND compiler version. 
From a release perspective this means users would have to be specific about 
which versions of datafusion and delta-rs packages they use, leading to a lot 
of difficult edge cases.
   
   ## Approach 2: Create FFI Table Provider
   
   In this approach we define a true FFI friendly Table Provider. We expose a 
PyCapsule with this table provider.
   
   - Advantages: As long as there have been no changes to the FFI definition 
*any* version of datafusion will work with *any* version of delta-rs. Also, it 
opens the door for integrating non-rust table providers, but I don't know if 
that's really important.
   - Disadvantages: This is a *huge* lift because we need to expose `Session` 
as well, which then exposes a lot more. Now, if we could make some argument 
about only allowing external table providers that do not require the session 
that would *vastly* simplify the problem but I do see that delta-rs is using 
things like the runtime environment and the session config.
   
   ## Evaluation
   
   I have each of these approaches working in a minimal fashion.
   
   For the direct expose it is working except I have an odd failure when trying 
to do a `show()` on the data frame. Oddly, I can execute the dataframe and do 
`count()` and other operations. There is some odd dependency along the line 
that is causing some fault during projection that I'm struggling to 
troubleshoot.
   
   For the FFI table provider, I've got the round trip working where we can get 
the schema from the table provider through FFI and I intentionally built them 
in different compiler modes to ensure the internal representations differed. 
The part I'm stuck on here is how much would have to be exposed to get all 
required functions of `TableProvider` working.
   
   I'm open to thoughts and suggestions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to