potiuk commented on code in PR #25780: URL: https://github.com/apache/airflow/pull/25780#discussion_r956388525
########## docs/apache-airflow/best-practices.rst: ########## @@ -619,3 +621,221 @@ Prune data before upgrading --------------------------- Some database migrations can be time-consuming. If your metadata database is very large, consider pruning some of the old data with the :ref:`db clean<cli-db-clean>` command prior to performing the upgrade. *Use with caution.* + + +Handling Python dependencies +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Airflow has many Python dependencies and sometimes the Airflow dependencies are conflicting with dependencies that your +task code expects. Since - by default - Airflow environment is just a single set of Python dependencies and single +Python environment, often there might also be cases that some of your tasks require different dependencies than other tasks +and the dependencies basically conflict between those tasks. + +If you are using pre-defined Airflow Operators to talk to external services, there is not much choice, but usually those +operators will have dependencies that are not conflicting with basic Airflow dependencies. Airflow uses constraints mechanism +which means that you have a "fixed" set of dependencies that the community guarantees that Airflow can be installed with +(including all community providers) without triggering conflicts. However you can upgrade the providers +independently and their constraints do not limit you so the chance of a conflicting dependency is lower (you still have +to test those dependencies). Therefore when you are using pre-defined operators, chance is that you will have +little, to no problems with conflicting dependencies. + +However, when you are approaching Airflow in a more "modern way", where you use TaskFlow Api and most of +your operators are written using custom python code, or when you want to write your own Custom Operator, +you might get to the point where the dependencies required by the custom code of yours are conflicting with those +of Airflow, or even that dependencies of several of your Custom Operators introduce conflicts between themselves. + +There are a number of strategies that can be employed to mitigate the problem. And while dealing with +dependency conflict in custom operators is difficult, it's actually quite a bit easier when it comes to +Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or +``PythonPreexistingVirtualenvOperator``. + +Let's start from the strategies that are easiest to implement (having some limits and overhead), and +we will gradually go through those strategies that requires some changes in your Airflow deployment. + +Using PythonVirtualenvOperator +------------------------------ + +This is simplest to use and most limited strategy. The PythonVirtualenvOperator allows you to dynamically +create a virtualenv that your Python callable function will execute in. In the modern +TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be done with decorating +your callable with ``@task.virtualenv`` decorator (recommended way of using the operator). +Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can +have its own independent Python virtualenv and can specify fine-grained set of requirements that need +to be installed for that task to execute. + +The operator takes care of: + +* creating the virtualenv based on your environment +* serializing your Python callable and passing it to execution by the virtualenv Python interpreter +* executing it and retrieving the result of the callable and pushing it via xcom if specified + +The benefits of the operator are: + +* There is no need to prepare the venv upfront. It will be dynamically created before task is run, and + removed after it is finished, so there is nothing special (except having virtualenv package in your + airflow dependencies) to make use of multiple virtual environments +* You can run tasks with different sets of dependencies on the same workers - thus Memory resources are + reused (though see below about the CPU overhead involved in creating the venvs). Review Comment: Right. It was a mental shortuct. I meant that you do not have to run multiple workers to handle multiple environments. So memory saving was from not essentially duplicating memory for running more workers. I reworded it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
