[GitHub] [airflow] potiuk commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

GitBox Mon, 21 Nov 2022 11:34:58 -0800


potiuk commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1322552034


   BTW. I believe there is something very wrong with your restarting scenario 
and configuration in general - some mistakes or misunderstanding on how image 
entrypoint works.
   
   ```
   ERROR: Pidfile (/opt/airflow/airflow-worker.pid) already exists.
   Seems we're already running? (pid: 1)
   ```
   
   I think there are some things you are doing wrong here and they compounded
   
   1) seems that you run airflow as init process in your container. This is 
possible but you need to realise the consequences of signal propagation and do 
it properly. You might fall into many traps of it if you are doing it wrongly 
so I recommend you to read why in airflow image we use dumb-init as init 
process and what consequences it has (especially for celery): 
https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation 
   
   The .pid file should only contain '1' if your process is started as "init" 
process - and this means that container will be killed when your process is 
running. When you use dumb-init as we do by default in our image, the dumb-init 
has process id 1, but in your case your airflow process will always has process 
id 1 and that is original root cause of the problem you have.
   
   2) then the problem is most likely that you write a .pid to  a shared volume 
which makes the pid file remain after killing the container. This is very, 
very, very, very wrong. If you rely on restarting the container and your 
process has PID = 1, you should never save the .pid file in the shared volume 
that can survive the container. Because you will get the exact problem you 
have. Your airflow webserver will always start as init process with PID =1. So 
even if the process has been killed, just the fact of restarting it will create 
a process ID 1 so airflow is really checking the PID file created by the 
previous "1" process with itself (which runs with PID=1) and it will never 
start. 
   
   
   This is  very much against the container philosophy. It should always be 
store in the ephemeral container volume, so that when your containers is 
stopped, the .pid file is gone. Make sure that you do not keep the .pid file in 
a shared volume - especially if you run your airflow command as entrypoint, 
because indeed, if you run 
   
   In general, if you restart whole containers rather than processes, the .pid 
should NEVER be stored in a shared volume - it should always be stored in the 
ephemeral container volume so that it gets automatically deleted when whole 
container gets killed.
   
   So I think you should really rethink the way entrypoint works in your 
images, the way you store the .pid files get created and the way how restart 
process of failed container works - seems like all the three points are 
custom-done by you and they compound to the problem you experience.  When you 
are using docker-compose approach, you need to reaise how this all works, how 
those elements interact and how to make it production-robust.
   
   Seems that you have chosen pretty hard path to walk, and going the beaten 
Helm + Kubernetes path without diverging too much from the approach we 
proposed, would have solved most of it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on issue #24731: Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up.

Reply via email to