gaojun2048 commented on pull request #1881:
URL: 
https://github.com/apache/arrow-datafusion/pull/1881#issuecomment-1057712514


   > I feel ideally they should use the same programing interface (SQL or 
DataFrame), DataFusion provide computation on a single node and Ballista add a 
distributed layer. With this assumption, DF is the compute core wouldn't it 
make sense to have udf support in DF?
   
   I don’t know if my understanding is wrong. I always think that DF is just a 
computing library, which cannot be directly deployed in production. Those who 
use DF will use DF as a dependency of the project and then develop their 
computing engine based on DF. For example, Ballista is a distributed computing 
engine developed based on DF. Ballista is a mature computing engine just like 
Presto/spark. People who use Ballista only need to download and deploy Ballista 
to their machines to start the ballista service. They rarely care about how 
Ballista is implemented, so a A udf plugin that supports dynamic loading allows 
these people to define their own udf functions without modifying Ballista's 
source code.
   
   ```
   I feel ideally they should use the same programing interface (SQL or 
DataFrame), DataFusion provide computation on a single node and Ballista add a 
distributed layer. With this assumption, DF is the compute core wouldn't it 
make sense to have udf support in DF?
   ```
   
   Yes, it is important and required for DF to support udf. But for those who 
use DF, it is not necessary to support the `udf plugin` to dynamically load 
udf. Because for people who use DF as a dependency to develop their own 
calculation engine, such as Ballista. Imagine one, if Ballista and DF are not 
under the same repository, but two separate projects, as a Ballista developer, 
I need to add my own udf to meet my special analysis needs. What I'm most 
likely to do is to manage my own udf, such as writing the implementation of udf 
directly in the Ballista crate. Or add a `udf plugin` to Ballista like this pr, 
which supports dynamic loading of udfs developed by Ballista users (not 
Ballista developers). Then I decide when to call the `register_udf` method of 
the DF to register these udfs in the `ExecutionContext` so that the DF can be 
used for calculation. Of course, we can directly put the udf plugin in DF, but 
this feature is not necessary for DF, and doing so will make the `register
 _udf` method look redundant, but make the design of DF's udf not easy to 
understand.
   
   So I would say that the people who need the `udf plugin` the most are those 
who use Ballista as a full-fledged computing engine, and they just download and 
deploy Ballista. They don't modify the source code of Ballista and DF because 
that would mean a better understanding of Ballista and DF. And once the source 
code of Ballista and DF is modified, it means that they need to invest more 
cost to merge and build when upgrading Ballista. But now if the user just 
downloads and deploys Ballista for use, there is no way for the user to 
register his udf into the DF. The core goal of the udf plugin is to provide an 
opportunity for those udfs that have not been compiled into the project to be 
discovered and registered in DF.
   
   Finally, if we define Ballista's goal as a distributed implementation of 
datafusion, a library that needs to be used as a dependency of other projects, 
rather than a distributed computing engine (like presto/spark) that can be 
directly downloaded and deployed and used. It seems to me that the udf plugin 
is not necessary, because the core goal of the udf plugin is to provide an 
opportunity for those udfs that have not been compiled into the project to be 
discovered and registered in DF. Those projects that use ballista as depencency 
can manage their own udf and decide when to register their udf into DF.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to