potiuk commented on PR #25780:
URL: https://github.com/apache/airflow/pull/25780#issuecomment-1221510678

   > The problem with `PythonPreexistingVirtualenvOperator` is it ideologically 
misleads the user to think it can only reference a virtual environment, while 
functionally it can work against much more (for example, you can set 
`python="/usr/bin/python"` to use system packages installed by apt, even if 
Airflow is not installed against that Python interpreter). This leads to worse 
discoverability.
   
   Why I like the PythonPreexistingVirtualenOperator name @uranusjr  ? I 
thought a lot about it and it was clear to me that it is better name at the 
moment I attempted to write the documentation.
   
   I think it is really the matter of what we want to promote, not what it CAN 
do and how it refers to the currently existing operator names that people are 
used to and are using already. I think "external Python" is far less 
discoverable if you consider what we want our users to do.
   
   First of all I think we want to promote virtualenv usage. As you well know, 
this is something also `pip` maintainer promote heavily - that pretty much any 
serious use of multiple python environments should be via virtualenv. This has 
been a hot topic in discussion I was participating at `pip` and it si very 
clear that using "a python binary" which is not part of virtualenv  is 
generally dangerous. It's fine if you want to build your "base" image. But we 
are talking about multiple environments in one environment (whether it's 
container image or local installation) virtualenv is the way to go. While it is 
ok to have "base" python installation as just system-installed Python, when it 
comes to multiple environments set in the same env - setting up virtualenv for 
those is pretty much, the only option. And definitely one that we should 
promote to the users. The fact that you CAN do something does not mean that you 
SHOULD. And in this case I think even if the users can use any non-virtua
 lenv Python binary, I think they should not. And the name is a strong 
indication they should not.
   
   But there is another aspect as well. And I realised that when I attempted to 
write the documentation. If you look at t the "Best Practices" document I 
updated you will - I hope - immediately see why the name is better. Airflow 
users are already used to Python, PythonVirtualenv, Docker, Kubernetes.  And 
the PythonPreexistingVirtualenvOperator is so much closer to PythonVirtualenv 
than PythonOperator. For example the callable needs to be serializable, 
depending on the installed packages different context variables are passed, you 
can use `dill` or `pickle`.  This operator is ALMOST the same from the 
behaviour as PythonVirtualenvOperator. The only difference is the 
startup/overhead and the way how you either specify the dependencies or make 
them embedded.
   
   The users have already a mental model built for how they can use airflow and 
it's better if we fit-in the exisitng mental model rather than try to crate a 
completely new one. 
   
   And if you look at the Best Practices docment - it feels super-natural. I 
wrote a chapter about managing dependencies, where the user can naturally 
progress from Python through PythonVirtualenv, to PythonPreexisitngVirtualenv. 
And if you see the line of thoughts there - it gradually changes the pros/cons. 
And it fits very well there. 
   
   I think it fits so well, that it actually finally makes 
PythonVirtualenvOperator actually useful. This is something that @jedcunningham 
mentioned - that previously that operator was not very useful, and I agree. 
However when we introduce the new operator, it suddenly changes: 
   
   I wrote a separate chapter about it - so you can read it in the PR but 
out-of-the sudden PythonVirtualenvOperator + 
PythonPreexistingVirtualenvOperator combo is a super useful pair of operators. 
One of the big drawbacks of the PythonPreexistingVirtualenvOperator. is that 
you have to have the venv prepared for you. And often as a data scientist or 
even data engineer, you will not be able to add your venv to the image or local 
installation because it is managed by others. But you really want to iterate on 
your tasks with new dependencies. The PythonPreexistingVirtualenvOperator makes 
it super hard. But on the other hand PythonVirtualenvOperator makes it super 
easy (and you do not need to ask anyone for permission). And in the new 
taskflow world, this will be just a matter of replacing your 
@task.preexisting_virtualenv with @task_virtualenv decorator (with new 
requirements) and you can test your new dependencies without changing a single 
line of code in your callable. This is super p
 owerful and it is I think the most important reason why it should be 
super-obvious from the names of those two operators that they are almost the 
same.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to