mingmwang edited a comment on pull request #2024:
URL: 
https://github.com/apache/arrow-datafusion/pull/2024#issuecomment-1076250450


   > > Hi, @thinkharderdev
   > > I think this PR doesn't resolve the issue. One of the key problem is, 
users register the UDF/UDFS with SessionContext.register_udf(), but how does 
the SessionContexts (ExecutionContext) in all the Executors know the registered 
UDF? The information is not propagate to executor side.
   > > For DataFusion, SessionContext.register_udf() is not a problem. But for 
Baliista, there are three different kind of SessionContext:
   > > 
   > > 1. the context created in Ballista client and uses the 
BallistaQueryPlanner to send logical plans to Scheduler, see the method 
create_df_ctx_with_ballista_query_planner()
   > > 2. the context created in Scheduler
   > > 3. the context created in Executor
   > > 
   > > I'm going to remove the context created in Executor because I think it 
does not make sense to have Executor hold a global SessionContext.
   > > Hi, @yjshen @alamb
   > > Please share your thoughts.
   > 
   > I think the idea is that there are two ways to register the UDF/UDAF in 
the executors:
   > 
   > 1. Build your own executor where you basically write your own entrypoint 
and setup the `SessionContext` with any extensions (codecs, udfs, udafs, 
optimizers, planners, etc) registered at startup.
   > 2. Use the mechanism being developed in [add udf/udaf pluginĀ 
#1881](https://github.com/apache/arrow-datafusion/pull/1881) to load udf/udaf 
as plugins which are put in a plugin directory and dynamically linked into the 
out of the box scheduler/executor binary.
   > 
   > Based on the discussion in that PR we were thinking that the plugin 
mechanism could use the `SessionContext` to do the registration of plugins at 
application start so internally we could use the same mechanism for both 
approaches. That is, you can either use plugins or roll your own main method 
but you still end up with extensions registered in the `SessionContext` and we 
use that for deserializing udfs/udafs in the plan.
   > 
   > Ultimately I think we do need some sort of context in the executor if we 
are going to support all the extension points that DataFusion provides in 
Ballista.
   
   Hi, @thinkharderdev 
   
   I'm going to remove the SessionContext in Executor code and make the 
Executor itself implement FunctionRegistry trait
   and pass the ```Arc<dyn FunctionRegistry> ``` down to the ``` 
try_into_physical_plan() ```and other required methods. 
   The major reason is Executor side does not need optimizers, planner, 
actually Executor side should not run any planning logic. The planner logic 
should only runs in Scheduler side.  And to support multiple tenancy 
SessionContext, it is weird to keep a global SessionContext in Executor side.
   
   For udf/udaf registration in Executor side,  I did not get a chance to scan 
the code in https://github.com/apache/arrow-datafusion/pull/1881.  The two 
approaches you just mentioned should both work. And to support session level 
udf/udaf(temporary functions),  we also need a way to propagate the temp 
functions' meta data and lib url or lib files from Scheduler side to Executor 
side.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to