AutomationDev85 opened a new issue, #27032:
URL: https://github.com/apache/airflow/issues/27032

   ### Apache Airflow version
   
   2.4.1
   
   ### What happened
   
   We are running an Airflow deployment and we hat the issue that the redis POD 
died and then some Task stuck in the queue state. Only after killing the worker 
POD the tasks were consumed by the worker again. I wanted to analyse this more 
in detail and saw that this behavior only occurs sometimes!
   
   For me looks like the worker some times does not detect that the connection 
to the redis Pod broke:
   1) If I do not see any error in the log file the worker does NOT reconnect 
once the worker is back!
   2) If I see this error in the log of the Worker it is WORKING and Worker 
automatically reconnects:
   [2022-10-13 06:29:55,967: WARNING/MainProcess] consumer: Connection to 
broker lost. Trying to re-establish the connection...
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py",
 line 332, in start
       blueprint.start(self)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/celery/bootsteps.py", line 
116, in start
       step.start(parent)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py",
 line 628, in start
       c.loop(*c.loop_args())
     File 
"/home/airflow/.local/lib/python3.8/site-packages/celery/worker/loops.py", line 
97, in asynloop
       next(loop)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/kombu/asynchronous/hub.py", 
line 362, in create_loop
       cb(*cbargs)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", 
line 1326, in on_readable
       self.cycle.on_readable(fileno)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", 
line 562, in on_readable
       chan.handlers[type]()
     File 
"/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", 
line 906, in _receive
       ret.append(self._receive_one(c))
     File 
"/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", 
line 916, in _receive_one
       response = c.parse_response()
     File "/home/airflow/.local/lib/python3.8/site-packages/redis/client.py", 
line 3505, in parse_response
       response = self._execute(conn, conn.read_response)
     File "/home/airflow/.local/lib/python3.8/site-packages/redis/client.py", 
line 3479, in _execute
       return command(*args, **kwargs)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 
739, in read_response
       response = self._parser.read_response()
     File 
"/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 
324, in read_response
       raw = self._buffer.readline()
     File 
"/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 
256, in readline
       self._read_from_socket()
     File 
"/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 
201, in _read_from_socket
       raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
   redis.exceptions.ConnectionError: Connection closed by server.
   
   ### What you think should happen instead
   
   Expected behavior is that the worker reconnects to redis automatically and 
starts consuming queues Tasks.
   
   ### How to reproduce
   
   1) Run DAG 2 tasks behind each other. 
   2) Then start DAG and during the first task is executed, force kill the 
redis POD (kubectl delete pod redis-0 -n ??? --grace-period=0  --force.) To 
simulate a crashing POD.
   3) Check if the worker reconnects automatically and executes next tasks or 
if task stuck in queue state and worker must be killed to fix this.
   
   ### Operating System
   
    AKSUbuntu-1804gen2
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Using a AKS Cluster in Azure to host Airflow.
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to