gaojun2048 commented on pull request #1881: URL: https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057712514
> I feel ideally they should use the same programing interface (SQL or DataFrame), DataFusion provide computation on a single node and Ballista add a distributed layer. With this assumption, DF is the compute core wouldn't it make sense to have udf support in DF? I don’t know if my understanding is wrong. I always think that DF is just a computing library, which cannot be directly deployed in production. Those who use DF will use DF as a dependency of the project and then develop their computing engine based on DF. For example, Ballista is a distributed computing engine developed based on DF. Ballista is a mature computing engine just like Presto/spark. People who use Ballista only need to download and deploy Ballista to their machines to start the ballista service. They rarely care about how Ballista is implemented, so a A udf plugin that supports dynamic loading allows these people to define their own udf functions without modifying Ballista's source code. ``` I feel ideally they should use the same programing interface (SQL or DataFrame), DataFusion provide computation on a single node and Ballista add a distributed layer. With this assumption, DF is the compute core wouldn't it make sense to have udf support in DF? ``` Yes, it is important and required for DF to support udf. But for those who use DF, it is not necessary to support the `udf plugin` to dynamically load udf. Because for people who use DF as a dependency to develop their own calculation engine, such as Ballista. Imagine one, if Ballista and DF are not under the same repository, but two separate projects, as a Ballista developer, I need to add my own udf to meet my special analysis needs. What I'm most likely to do is to manage my own udf, such as writing the implementation of udf directly in the Ballista crate. Or add a `udf plugin` to Ballista like this pr, which supports dynamic loading of udfs developed by Ballista users (not Ballista developers). Then I decide when to call the `register_udf` method of the DF to register these udfs in the `ExecutionContext` so that the DF can be used for calculation. Of course, we can directly put the udf plugin in DF, but this feature is not necessary for DF, and doing so will make the `register _udf` method look redundant, but make the design of DF's udf not easy to understand. So I would say that the people who need the `udf plugin` the most are those who use Ballista as a full-fledged computing engine, and they just download and deploy Ballista. They don't modify the source code of Ballista and DF because that would mean a better understanding of Ballista and DF. And once the source code of Ballista and DF is modified, it means that they need to invest more cost to merge and build when upgrading Ballista. But now if the user just downloads and deploys Ballista for use, there is no way for the user to register his udf into the DF. The core goal of the udf plugin is to provide an opportunity for those udfs that have not been compiled into the project to be discovered and registered in DF. Finally, if we define Ballista's goal as a distributed implementation of datafusion, a library that needs to be used as a dependency of other projects, rather than a distributed computing engine (like presto/spark) that can be directly downloaded and deployed and used. It seems to me that the udf plugin is not necessary, because the core goal of the udf plugin is to provide an opportunity for those udfs that have not been compiled into the project to be discovered and registered in DF. Those projects that use ballista as depencency can manage their own udf and decide when to register their udf into DF. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
