timsaucer commented on issue #823: URL: https://github.com/apache/datafusion-python/issues/823#issuecomment-2341714008
I've done some additional testing with mixed success. ## Approach 1: Direct Expose In this approach we basically just expose a function `register_table_provider` that takes a `Arc<dyn TableProvider>` and put this in a simple structure we can expose via PyCapsule. - Advantages: This is *simple* in terms of `datafusion-python`. It's really a matter of exposing the PyCapsule and puts all of the responsibility of making it work on the consumer. - Disadvantages: This requires not only the datafusion version to match exactly between the consumer and `datafusion-python` but they must also use the same arrow dependencies *and* the same compiler version to get compatible binaries. This means that every version of delta-rs would have to update in lock step with datafusion - including sub dependencies AND compiler version. From a release perspective this means users would have to be specific about which versions of datafusion and delta-rs packages they use, leading to a lot of difficult edge cases. ## Approach 2: Create FFI Table Provider In this approach we define a true FFI friendly Table Provider. We expose a PyCapsule with this table provider. - Advantages: As long as there have been no changes to the FFI definition *any* version of datafusion will work with *any* version of delta-rs. Also, it opens the door for integrating non-rust table providers, but I don't know if that's really important. - Disadvantages: This is a *huge* lift because we need to expose `Session` as well, which then exposes a lot more. Now, if we could make some argument about only allowing external table providers that do not require the session that would *vastly* simplify the problem but I do see that delta-rs is using things like the runtime environment and the session config. ## Evaluation I have each of these approaches working in a minimal fashion. For the direct expose it is working except I have an odd failure when trying to do a `show()` on the data frame. Oddly, I can execute the dataframe and do `count()` and other operations. There is some odd dependency along the line that is causing some fault during projection that I'm struggling to troubleshoot. For the FFI table provider, I've got the round trip working where we can get the schema from the table provider through FFI and I intentionally built them in different compiler modes to ensure the internal representations differed. The part I'm stuck on here is how much would have to be exposed to get all required functions of `TableProvider` working. I'm open to thoughts and suggestions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
