jdye64 commented on issue #173: URL: https://github.com/apache/arrow-ballista/issues/173#issuecomment-1387195133
Yes, @adriangb is right. Much pain comes in trying to serialize and execute python code on remote nodes that have dependencies. This has been the case since even for Hive UDFs back years ago. The python ecosystem as a whole is one that relies heavily on existing dependencies. Therefore I think if we can come up with a straightforward method for ensuring all of the executors have a valid virtual environment with all the dependencies required by the UDF installed we should be good. This is the approach we take in some parts of Dask for example. So maybe as part of the Python UDF registration we require a "list" of dependencies that are required by the UDF. When the executor server starts up it could create that virtual env, through pip or conda or whatever is chosen, and installed those dependencies. Think of it like a executor server bootstrapping process. Then when any sql queries are submitted the UDF can be serialized and sent to the executor, once there the UDF can be executed in that virtual environment. Couple of thoughts - Maybe that information about Python dependencies could live in the "catalog" description space of flight_sql in ballista? - I think being able to run Python UDFs is a must, almost not even worth having Python UDF support if dependencies can't be used. This is just my opinion and not a fact. - I can remember old versions of Hive required the user to manually SSH to each node and manually install those python dependencies. It was the quickest route I ever discovered to making enemies in the dev ops teams =) . I think this path is a non starter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
