dabla commented on PR #62867: URL: https://github.com/apache/airflow/pull/62867#issuecomment-4184029714
> Also, I have a question: why are changes related to DataFusion required in the providers that no need to know about the datafusion? As I mentioned earlier, there was an initial idea to introduce a separate provider—something like apache-airflow-providers-apache-datafusion. > > DataFusion isn’t limited to object storage support alone. There are also table provider capabilities to consider. Currently, DataFusion supports systems like Iceberg, Delta Lake, and Hudi. I’ve also been involved in discussions (https://github.com/datafusion-contrib/datafusion-table-providers ) about integrating these functionalities more directly part of the datafusion provider, currently they dont have python wrappers and eventually i get it there.., so that everything works more seamlessly and delivers better performance. If a separate provider were introduced for DataFusion, I would fully agree with your approach. The main concern right now is that the DataFusion implementation imports the Google hook. This effectively creates a dependency between both DataFusion and the Google provider within common-sql, which feels like an unnecessary coupling. Without that import, the current approach would be fine. However, in its current form, it introduces a cross-provider dependency that we should probably avoid. I believe the best solution would be to create a dedicated DataFusion provider. That way, it can explicitly depend on both DataFusion itself and the Google provider (for the hook), while keeping common-sql free from any unintended external dependencies. But maybe I’m wrong, so curious what others think of this as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
