AutomationDev85 opened a new issue, #27032:
URL: https://github.com/apache/airflow/issues/27032
### Apache Airflow version
2.4.1
### What happened
We are running an Airflow deployment and we hat the issue that the redis POD
died and then some Task stuck in the queue state. Only after killing the worker
POD the tasks were consumed by the worker again. I wanted to analyse this more
in detail and saw that this behavior only occurs sometimes!
For me looks like the worker some times does not detect that the connection
to the redis Pod broke:
1) If I do not see any error in the log file the worker does NOT reconnect
once the worker is back!
2) If I see this error in the log of the Worker it is WORKING and Worker
automatically reconnects:
[2022-10-13 06:29:55,967: WARNING/MainProcess] consumer: Connection to
broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py",
line 332, in start
blueprint.start(self)
File
"/home/airflow/.local/lib/python3.8/site-packages/celery/bootsteps.py", line
116, in start
step.start(parent)
File
"/home/airflow/.local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py",
line 628, in start
c.loop(*c.loop_args())
File
"/home/airflow/.local/lib/python3.8/site-packages/celery/worker/loops.py", line
97, in asynloop
next(loop)
File
"/home/airflow/.local/lib/python3.8/site-packages/kombu/asynchronous/hub.py",
line 362, in create_loop
cb(*cbargs)
File
"/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py",
line 1326, in on_readable
self.cycle.on_readable(fileno)
File
"/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py",
line 562, in on_readable
chan.handlers[type]()
File
"/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py",
line 906, in _receive
ret.append(self._receive_one(c))
File
"/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py",
line 916, in _receive_one
response = c.parse_response()
File "/home/airflow/.local/lib/python3.8/site-packages/redis/client.py",
line 3505, in parse_response
response = self._execute(conn, conn.read_response)
File "/home/airflow/.local/lib/python3.8/site-packages/redis/client.py",
line 3479, in _execute
return command(*args, **kwargs)
File
"/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line
739, in read_response
response = self._parser.read_response()
File
"/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line
324, in read_response
raw = self._buffer.readline()
File
"/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line
256, in readline
self._read_from_socket()
File
"/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line
201, in _read_from_socket
raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.
### What you think should happen instead
Expected behavior is that the worker reconnects to redis automatically and
starts consuming queues Tasks.
### How to reproduce
1) Run DAG 2 tasks behind each other.
2) Then start DAG and during the first task is executed, force kill the
redis POD (kubectl delete pod redis-0 -n ??? --grace-period=0 --force.) To
simulate a crashing POD.
3) Check if the worker reconnects automatically and executes next tasks or
if task stuck in queue state and worker must be killed to fix this.
### Operating System
AKSUbuntu-1804gen2
### Versions of Apache Airflow Providers
_No response_
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
Using a AKS Cluster in Azure to host Airflow.
### Anything else
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]