hterik opened a new issue, #58670:
URL: https://github.com/apache/airflow/issues/58670

   ### Description
   
   When writing data pipelines, the dag code very often has a high dependency 
to python libraries and other tools installed in the environment. Today, when 
running Airflow locally with the `CeleryExecutor`, it continuously fetches and 
reparses dag code, but it will not reimport any other libraries that the dag 
depends on. The dag code basically runs in whatever environment the executor 
was first launched in. 
   
   It would be very helpful if Airflow could automatically fetch a Docker image 
and run each taskinstance inside a new instance of it. This is how it works 
with the `KuberenetesExecutor` today, where it greatly simplifies deployment of 
updated libraries and providers. 
   
   I'm aware of the `@task.docker`, while it sortof works, it has many rough 
edges that make it less of a seamless experience compared to how the 
`KubernetesExecutor` works.
   * Imports made outside of the `@task` are not present in the task. Basically 
the whole task is a different process. 
   * Dagrun Context and other things like `xcom_pull` is not accessible
   * Common volumes, environment and other `docker run` arguments need to be 
set on every task, instead of once per each Executor instance. Likewise for 
other essentials like `auto_remove` that should not be task-unique.
   * I have bad experience of the jinja template corrupting the temporary 
python file but can't remember exact details of this.
   * Error/Exception handling is not as smooth. For example it's not possible 
to categorize retriable vs expected  exceptions (`AirflowFailException` is 
missing and `AirflowSkipException` need to be re-modeled as a exit-code)
   * Logging becomes bound to the task and not the executor.
   
   Some of the hurdles can be mitigated with a custom wrapper around 
`@task.docker` but it's still not as smooth as it could be. Long story short, 
it feels like you are working against the system instead of being supported by 
it. Otherwise i think the idea behind `@task.docker` is good, if you need 
multiple different execution environments per task, it's just that the 
abstraction hides a bit too much and the integration with the dagrun is missing 
a few pieces.
   
   We've given up on `@task.docker` and to workaround the use case of 
continuously updating libraries,  we've made a background process that 
regularly updates and restarts the entire Docker image that runs the Airflow 
Celery Executor, just to get the latest environment, and then tasks are run as 
basic `@task`. This is far from ideal as it requires every task to complete 
before being able to update. It also makes QA-deployment of a pipeline much 
more difficult, if you want to dryrun a pipeline with its dependencies before 
merging it. 
   
   (Same question also applies to the `EdgeExecutor`)
   
   ### Use case/motivation
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to