sstoefe opened a new issue #13675:
URL: https://github.com/apache/airflow/issues/13675
<!--
Welcome to Apache Airflow! For a smooth issue process, try to answer the
following questions.
Don't worry if they're not all applicable; just try to include what you can
:-)
If you need to include code snippets or logs, please put them in fenced code
blocks. If they're super-long, please use the details tag like
<details><summary>super-long log</summary> lots of stuff </details>
Please delete these comment blocks before submitting the issue.
-->
<!--
IMPORTANT!!!
PLEASE CHECK "SIMILAR TO X EXISTING ISSUES" OPTION IF VISIBLE
NEXT TO "SUBMIT NEW ISSUE" BUTTON!!!
PLEASE CHECK IF THIS ISSUE HAS BEEN REPORTED PREVIOUSLY USING SEARCH!!!
Please complete the next sections or the issue will be closed.
These questions are the first thing we need to know to understand the
context.
-->
**Apache Airflow version**: v2.0.0
**Git Version**: release:2.0.0+ab5f770bfcd8c690cbe4d0825896325aca0beeca
**Docker version**: Docker version 20.10.1, build 831ebeae96
**Environment**:
- **Cloud provider or hardware configuration**: local setup, docker engine
in swarm mode, docker stack deploy
- **OS** (e.g. from /etc/os-release): Manjaro Linux
- **Kernel** (e.g. `uname -a`): 5.9.11
- **Install tools**:
- docker airflow image apache/airflow:2.0.0-python3.8 (hash _fe4a64af9553_)
- **Others**:
**What happened**:
When using `DockerSwarmOperator` (either `contrib` or `providers` module)
together with the default `enable_logging=True` option, tasks do not succeed
and stay in state `running`. When checking the `docker service logs` I can
clearly see that the container ran and ended successfully. Airflow however does
not recognize that the container finished and keeps the tasks in state
`running`.
However, when using `enable_logging=False` AND `auto_remove=False`
containers are recognized as finished and tasks are correctly in state
`success`. When using `enable_logging=False` and `auto_remove=True` I get the
following error message
```
{taskinstance.py:1396} ERROR - 404 Client Error: Not Found ("service
936om1s4zso10ye5ferhvwnxn not found")
```
<!-- (please include exact error messages if you can) -->
**What you expected to happen**:
When I run a DAG with `DockerSwarmOperator`s in it I expect that docker
containers are distributed to the docker swarm and that container logs and
states are correctly tracked by the DockerSwarmOperator. Meaning, with
`enable_logging=True` option I would expect that the TaskInstance's log
contains the logging output of the docker container/service. Furthermore, when
using the `auto_remove=True` option I would expect that docker services are
removed after the TaskInstance is finished successfully.
<!-- What do you think went wrong? -->
It looks like something is broken with the `enable_logging` and
`auto_remove=True` options.
**How to reproduce it**:
#### **`Dockerfile`**
```
FROM apache/airflow:2.0.0-python3.8
ARG DOCKER_GROUP_ID
USER root
RUN groupadd --gid $DOCKER_GROUP_ID docker \
&& usermod -aG docker airflow
USER airflow
```
airflow user needs to be in the docker group to have access to the docker
daemon
#### **build the Dockerfile**
```
docker build --build-arg DOCKER_GROUP_ID=$(getent group docker | awk -F:
'{print $3}') -t docker-swarm-bug .
```
#### **`docker-stack.yml`**
```
version: "3.2"
networks:
airflow:
services:
postgres:
image: postgres:13.1
environment:
- POSTGRES_USER=airflow
- POSTGRES_DB=airflow
- POSTGRES_PASSWORD=airflow
- PGDATA=/var/lib/postgresql/data/pgdata
ports:
- 5432:5432
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./database/data:/var/lib/postgresql/data/pgdata
- ./database/logs:/var/lib/postgresql/data/log
command: >
postgres
-c listen_addresses=*
-c logging_collector=on
-c log_destination=stderr
-c max_connections=200
networks:
- airflow
redis:
image: redis:5.0.5
environment:
REDIS_HOST: redis
REDIS_PORT: 6379
ports:
- 6379:6379
networks:
- airflow
webserver:
env_file:
- .env
image: docker-swarm-bug:latest
ports:
- 8080:8080
volumes:
- ./airflow_files/dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./files:/opt/airflow/files
- /var/run/docker.sock:/var/run/docker.sock
deploy:
restart_policy:
condition: on-failure
delay: 8s
max_attempts: 3
depends_on:
- postgres
- redis
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /opt/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
networks:
- airflow
flower:
image: docker-swarm-bug:latest
env_file:
- .env
ports:
- 5555:5555
depends_on:
- redis
deploy:
restart_policy:
condition: on-failure
delay: 8s
max_attempts: 3
volumes:
- ./logs:/opt/airflow/logs
command: celery flower
networks:
- airflow
scheduler:
image: docker-swarm-bug:latest
env_file:
- .env
volumes:
- ./airflow_files/dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./files:/opt/airflow/files
- /var/run/docker.sock:/var/run/docker.sock
command: scheduler
deploy:
restart_policy:
condition: on-failure
delay: 8s
max_attempts: 3
networks:
- airflow
worker:
image: docker-swarm-bug:latest
env_file:
- .env
volumes:
- ./airflow_files/dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./files:/opt/airflow/files
- /var/run/docker.sock:/var/run/docker.sock
command: celery worker
depends_on:
- scheduler
deploy:
restart_policy:
condition: on-failure
delay: 8s
max_attempts: 3
networks:
- airflow
initdb:
image: docker-swarm-bug:latest
env_file:
- .env
volumes:
- ./airflow_files/dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./files:/opt/airflow/files
- /var/run/docker.sock:/var/run/docker.sock
entrypoint: /bin/bash
deploy:
restart_policy:
condition: on-failure
delay: 8s
max_attempts: 5
command: -c "airflow db init && airflow users create --firstname admin
--lastname admin --email admin --password admin --username admin --role Admin"
depends_on:
- redis
- postgres
networks:
- airflow
```
#### **`docker_swarm_bug.py`**
```
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.providers.docker.operators.docker_swarm import
DockerSwarmOperator
# you can also try DockerSwarmOperator from contrib module, shouldn't make a
difference
# from airflow.contrib.operators.docker_swarm_operator import
DockerSwarmOperator
default_args = {
"owner": "airflow",
"start_date": "2021-01-14"
}
with DAG(
"docker_swarm_bug", default_args=default_args, schedule_interval="@once"
) as dag:
start_op = BashOperator(
task_id="start_op", bash_command="echo start testing multiple
dockers",
)
docker_swarm = list()
for i in range(16):
docker_swarm.append(
DockerSwarmOperator(
task_id=f"docker_swarm_{i}",
image="hello-world:latest",
force_pull=True,
auto_remove=True,
api_version="auto",
docker_url="unix://var/run/docker.sock",
network_mode="bridge",
enable_logging=False,
)
)
finish_op = BashOperator(
task_id="finish_op", bash_command="echo finish testing multiple
dockers",
)
start_op >> docker_swarm >> finish_op
```
#### **create directories, copy DAG and set permissions**
```
mkdir -p airflow_files/dags
cp docker_swarm_bug.py airflow_files/dags/
mkdir logs
mkdir files
sudo chown -R 50000 airflow_files logs files
```
uid 50000 is the id of the airflow user inside the docker images
#### **deploy `docker-stack.yml`**
```
docker stack deploy --compose-file docker-stack.yml airflow
```
#### **trigger DAG `docker_swarm_bug` in UI**
**Anything else we need to know**:
Problem occurs with the options `enable_logging=True`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]