GitHub user hterik created a discussion: Add a "`DockerExecutor`" inside 
`CeleryExecutor`

### Description

When writing data pipelines, the dag code very often has a high dependency to 
python libraries and other tools installed in the environment. Today, when 
running Airflow locally with the `CeleryExecutor`, it continuously fetches and 
reparses dag code, but it will not reimport any other libraries that the dag 
depends on. The dag code basically runs in whatever environment the executor 
was first launched in. 

It would be very helpful if Airflow could automatically fetch a Docker image 
and run each taskinstance inside a new instance of it. This is how it works 
with the `KuberenetesExecutor` today, where it greatly simplifies deployment of 
updated libraries and providers. 

I'm aware of the `@task.docker`, while it sortof works, it has many rough edges 
that make it less of a seamless experience compared to how the 
`KubernetesExecutor` works.
* Imports made outside of the `@task` are not present in the task. Basically 
the whole task is a different process. 
* Dagrun Context and other things like `xcom_pull` is not accessible
* Common volumes, environment and other `docker run` arguments need to be set 
on every task, instead of once per each Executor instance. Likewise for other 
essentials like `auto_remove` that should not be task-unique.
* I have bad experience of the jinja template corrupting the temporary python 
file but can't remember exact details of this.
* Error/Exception handling is not as smooth. For example it's not possible to 
categorize retriable vs expected  exceptions (`AirflowFailException` is missing 
and `AirflowSkipException` need to be re-modeled as a exit-code)
* Logging becomes bound to the task and not the executor.

Some of the hurdles can be mitigated with a custom wrapper around 
`@task.docker` but it's still not as smooth as it could be. Long story short, 
it feels like you are working against the system instead of being supported by 
it. Otherwise i think the idea behind `@task.docker` is good, if you need 
multiple different execution environments per task, it's just that the 
abstraction hides a bit too much and the integration with the dagrun is missing 
a few pieces.

We've given up on `@task.docker` and to workaround the use case of continuously 
updating libraries,  we've made a background process that regularly updates and 
restarts the entire Docker image that runs the Airflow Celery Executor, just to 
get the latest environment, and then tasks are run as basic `@task`. This is 
far from ideal as it requires every task to complete before being able to 
update. It also makes QA-deployment of a pipeline much more difficult, if you 
want to dryrun a pipeline with its dependencies before merging it. 

(Same question also applies to the `EdgeExecutor`)

### Use case/motivation

_No response_

### Related issues

_No response_

### Are you willing to submit a PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


GitHub link: https://github.com/apache/airflow/discussions/58687

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to