uranusjr commented on issue #15286: URL: https://github.com/apache/airflow/issues/15286#issuecomment-818866373
I thought about this a bit and feel there are two things here to consider. The first is the overhead for `PythonVirtualenvOperator` to populate the virtuale environment, which (as mentioned above) should be solved by introducing some caching mechanism, something similar to how CI caches stuff between runs. This is very much worth doing. There is another use case surrounding `PythonVirtualenvOperator`, however—people wanting more control over the environment used to run Python code. Maybe there are some dependencies that can’t be covered by Python packaging, or require special configuration of the environment. Or maybe the user is simply migrating from an existing cron setup and want to reuse the environments first to avoid re-writing everything all at once. Currently people would need to “drop down” to `BashOperator` to achieve this, and while that definitely works, kind of “wastes” the knowledge the operator is running Python, and prevents nice things we can do with that knowledge. I think two solutions are needed for the two problems. The first is probably more intuitive to design; we can add caching options to `PythonVirtualenvOperator` to make Airflow cache and reuse the environment (or a subset of it); we can steal some ideas from CI designs for this. The other is less straightforward; my current idea is to introduce a `ExternalPythonOperator` (please recommend better names) that, instead of taking a requirement to create a virtual environment from, simply takes a path to a Python executable to run the Python callable with. The behaviour would otherwise be very similar to `PythonVirtualenvOperator`, including all the code generation and pickling caveats. This would be much easier to implement than the caching one (which, also mentioned above, requires tricky considerations with parallelism). So I’ll probably start with it and see what I can do. Any advices are very welcomed! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
