mattvonrocketstein commented on issue #13542: URL: https://github.com/apache/airflow/issues/13542#issuecomment-1138828023
Question for other MWAA users.. have you guys tried setting max-workers==min-workers, basically disabling autoscaling? Is anyone _without_ autoscaling actually seeing this stuff, regardless of airflow version? We've also talked to the MWAA team, and haven't heard clear answers about whether messages/workers are properly drained when down-scaling, so I'm wondering if that's not the crux of this issue, basically where queue state becomes inconsistent due to weird race conditions with improper worker shutdown. As the MWAA backend is pretty opaque to end-users, it's possible that downscaling is nothing more complicated or careful than just terminating an EC2 worker, or fargate pod, or whatever. However, IDK much about airflow/celery internals as far as redelivery, dead-letter queues, etc, so I might be way off base here. Since this is something that arguably could/should be fixed in a few different places (the MWAA core infrastructure, or the celery codebase, or the airflow codebase).. it seems likely that the problem may stick around for a while as well as the confusion about what versions are affected. The utility DAGs in this thread are an awesome reference ❤️ , and it may come to that but still hoping for a different work-around. Airflow version-upgrades or something would also leave us with a big stack of things to migrate and we can't jump into that immediately. Without autoscaling we can expect things to get more expensive, but we're thinking maybe it's worth it at this point to buy more stability. Anyone got more info? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
