[GitHub] [arrow-ballista] adriangb commented on issue #173: Add support for Python UDFs in distributed queries

GitBox Wed, 18 Jan 2023 13:55:08 -0800


adriangb commented on issue #173:
URL: https://github.com/apache/arrow-ballista/issues/173#issuecomment-1396138349


   > Where would the HTTP server be hosted? Scheduler? Single Executor process? 
Multiple Executor processes? An entirely new process?
   
   I'd think it'd be very similar to an executor process (a UDF executor 
process?). But I'm not super familiar with how executors run now, how they 
scale, etc.
   
   > Wouldn't it make more sense to have that communication channel be Arrow 
flight instead of HTTP?
   
   With this example above I'd say it can be Arrow flight or any other 
communication protocol, I'm not constraining it to HTTP.
   It would be really cool though if as a user I could deploy one thing that 
can talk over Arrow Flight for use as a UDF but also serve an HTTP API.
   
   > Seems like this would introduce a large performance hit having to make an 
external (data movement) remote invocation for each RecordBatch since the data 
would need to be moved to the HTTP UDF service running on another host.
   
   Yeah this I don't know about. Would the data be currently residing in the 
Executor? If so the only way to do everything in memory would be to run these 
UDFs on the Executor itself. Which gets into the complication of dependency 
management. Could users wrap the executor itself? Maybe when `UDFExecutorApp` 
registers itself with the scheduler it _becomes_ an executor which execute 
regular queries but also the UDFs that it is running locally? It might get 
confusing if different executors have different UDFs available...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-ballista] adriangb commented on issue #173: Add support for Python UDFs in distributed queries

Reply via email to