GitHub user hterik created a discussion: Add a "`DockerExecutor`" inside `CeleryExecutor`
### Description When writing data pipelines, the dag code very often has a high dependency to python libraries and other tools installed in the environment. Today, when running Airflow locally with the `CeleryExecutor`, it continuously fetches and reparses dag code, but it will not reimport any other libraries that the dag depends on. The dag code basically runs in whatever environment the executor was first launched in. It would be very helpful if Airflow could automatically fetch a Docker image and run each taskinstance inside a new instance of it. This is how it works with the `KuberenetesExecutor` today, where it greatly simplifies deployment of updated libraries and providers. I'm aware of the `@task.docker`, while it sortof works, it has many rough edges that make it less of a seamless experience compared to how the `KubernetesExecutor` works. * Imports made outside of the `@task` are not present in the task. Basically the whole task is a different process. * Dagrun Context and other things like `xcom_pull` is not accessible * Common volumes, environment and other `docker run` arguments need to be set on every task, instead of once per each Executor instance. Likewise for other essentials like `auto_remove` that should not be task-unique. * I have bad experience of the jinja template corrupting the temporary python file but can't remember exact details of this. * Error/Exception handling is not as smooth. For example it's not possible to categorize retriable vs expected exceptions (`AirflowFailException` is missing and `AirflowSkipException` need to be re-modeled as a exit-code) * Logging becomes bound to the task and not the executor. Some of the hurdles can be mitigated with a custom wrapper around `@task.docker` but it's still not as smooth as it could be. Long story short, it feels like you are working against the system instead of being supported by it. Otherwise i think the idea behind `@task.docker` is good, if you need multiple different execution environments per task, it's just that the abstraction hides a bit too much and the integration with the dagrun is missing a few pieces. We've given up on `@task.docker` and to workaround the use case of continuously updating libraries, we've made a background process that regularly updates and restarts the entire Docker image that runs the Airflow Celery Executor, just to get the latest environment, and then tasks are run as basic `@task`. This is far from ideal as it requires every task to complete before being able to update. It also makes QA-deployment of a pipeline much more difficult, if you want to dryrun a pipeline with its dependencies before merging it. (Same question also applies to the `EdgeExecutor`) ### Use case/motivation _No response_ ### Related issues _No response_ ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) GitHub link: https://github.com/apache/airflow/discussions/58687 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
