mattvonrocketstein commented on issue #13542:
URL: https://github.com/apache/airflow/issues/13542#issuecomment-1138828023

   Question for other MWAA users.. have you guys tried setting 
max-workers==min-workers, basically disabling autoscaling?  Is anyone _without_ 
autoscaling actually seeing this stuff, regardless of airflow version?
   
   We've also talked to the MWAA team, and haven't heard clear answers about 
whether messages/workers are properly drained when down-scaling, so I'm 
wondering if that's not the crux of this issue, basically where queue state 
becomes inconsistent due to weird race conditions with improper worker 
shutdown.  As the MWAA backend is pretty opaque to end-users, it's possible 
that downscaling is nothing more complicated or careful than just terminating 
an EC2 worker, or fargate pod, or whatever.  However, IDK much about 
airflow/celery internals as far as redelivery, dead-letter queues, etc, so I 
might be way off base here.  
   
   Since this is something that arguably could/should be fixed in a few 
different places (the MWAA core infrastructure, or the celery codebase, or the 
airflow codebase).. it seems likely that the problem may stick around for a 
while as well as the confusion about what versions are affected.  The utility 
DAGs in this thread are an awesome reference ❤️ , and it may come to that but 
still hoping for a different work-around.  Airflow version-upgrades or 
something would also leave us with a big stack of things to migrate and we 
can't jump into that immediately.  Without autoscaling we can expect things to 
get more expensive, but we're thinking maybe it's worth it at this point to buy 
more stability.  Anyone got more info?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to