potiuk commented on PR #25780: URL: https://github.com/apache/airflow/pull/25780#issuecomment-1221510678
> The problem with `PythonPreexistingVirtualenvOperator` is it ideologically misleads the user to think it can only reference a virtual environment, while functionally it can work against much more (for example, you can set `python="/usr/bin/python"` to use system packages installed by apt, even if Airflow is not installed against that Python interpreter). This leads to worse discoverability. Why I like the PythonPreexistingVirtualenOperator name @uranusjr ? I thought a lot about it and it was clear to me that it is better name at the moment I attempted to write the documentation. I think it is really the matter of what we want to promote, not what it CAN do and how it refers to the currently existing operator names that people are used to and are using already. I think "external Python" is far less discoverable if you consider what we want our users to do. First of all I think we want to promote virtualenv usage. As you well know, this is something also `pip` maintainer promote heavily - that pretty much any serious use of multiple python environments should be via virtualenv. This has been a hot topic in discussion I was participating at `pip` and it si very clear that using "a python binary" which is not part of virtualenv is generally dangerous. It's fine if you want to build your "base" image. But we are talking about multiple environments in one environment (whether it's container image or local installation) virtualenv is the way to go. While it is ok to have "base" python installation as just system-installed Python, when it comes to multiple environments set in the same env - setting up virtualenv for those is pretty much, the only option. And definitely one that we should promote to the users. The fact that you CAN do something does not mean that you SHOULD. And in this case I think even if the users can use any non-virtua lenv Python binary, I think they should not. And the name is a strong indication they should not. But there is another aspect as well. And I realised that when I attempted to write the documentation. If you look at t the "Best Practices" document I updated you will - I hope - immediately see why the name is better. Airflow users are already used to Python, PythonVirtualenv, Docker, Kubernetes. And the PythonPreexistingVirtualenvOperator is so much closer to PythonVirtualenv than PythonOperator. For example the callable needs to be serializable, depending on the installed packages different context variables are passed, you can use `dill` or `pickle`. This operator is ALMOST the same from the behaviour as PythonVirtualenvOperator. The only difference is the startup/overhead and the way how you either specify the dependencies or make them embedded. The users have already a mental model built for how they can use airflow and it's better if we fit-in the exisitng mental model rather than try to crate a completely new one. And if you look at the Best Practices docment - it feels super-natural. I wrote a chapter about managing dependencies, where the user can naturally progress from Python through PythonVirtualenv, to PythonPreexisitngVirtualenv. And if you see the line of thoughts there - it gradually changes the pros/cons. And it fits very well there. I think it fits so well, that it actually finally makes PythonVirtualenvOperator actually useful. This is something that @jedcunningham mentioned - that previously that operator was not very useful, and I agree. However when we introduce the new operator, it suddenly changes: I wrote a separate chapter about it - so you can read it in the PR but out-of-the sudden PythonVirtualenvOperator + PythonPreexistingVirtualenvOperator combo is a super useful pair of operators. One of the big drawbacks of the PythonPreexistingVirtualenvOperator. is that you have to have the venv prepared for you. And often as a data scientist or even data engineer, you will not be able to add your venv to the image or local installation because it is managed by others. But you really want to iterate on your tasks with new dependencies. The PythonPreexistingVirtualenvOperator makes it super hard. But on the other hand PythonVirtualenvOperator makes it super easy (and you do not need to ask anyone for permission). And in the new taskflow world, this will be just a matter of replacing your @task.preexisting_virtualenv with @task_virtualenv decorator (with new requirements) and you can test your new dependencies without changing a single line of code in your callable. This is super p owerful and it is I think the most important reason why it should be super-obvious from the names of those two operators that they are almost the same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
