potiuk commented on issue #28740: URL: https://github.com/apache/airflow/issues/28740#issuecomment-1373549644
> Thank for your reply @potiuk. > > We will upgrade EKS version to 1.22 or 1.23 and get back to you on this. > In the meantime I found out that it might also be that you need to upgrade airflow (and effectively Celery) - because this is a know bug in the celery liveness check - not in Airflow, not in Kubernetes. Here is a relevant issue https://github.com/apache/airflow/issues/21026 and https://github.com/apache/airflow/pull/19703 - If you have Celery before 5.2.3 you might experience it too. So you might also check your celery version. > Regarding scheduler and Trigger pod memory leak. we yet to find the root cause. I think this one might not be memory leak - but increased cache memory usage because you do not rotate logs. This is not a memory leak per-se - you might simply just observe wrong memory observed (And actually that might also be a reason for the first "memory leak" in celery as well - but I would rather try k8s upgrade first anyway). You have not explained which memory you observe - but what you might see is the usual behaviour of Linux kernel, where it keeps memory-mapped files when they are saved and if the files are not rotated/moved, the cache is never freed. This is standard behaviour of Linux (generally linux in such case will fill all available memory with cache to speed up file access). If this is the case, then it is generally harmless (the memory is freed immediately when needed by something else and you will never get "out-of-memory" errors because of it), however we implemented a way how to hint the kernel not to do it when we are writing logs and it's been released in Airflow 2.4.3 https://github.com/apache/airflow/pull/27223 so upgrading to Airflow 2.4.3+ will solve it, if this is the case. In general - I'd STRONGLY recommend you to migrate to latest released Airflow (2.5.0 currently). There are mutliple hundreds of issues fixed and you will save yourself and people who want to help you to chase the problems that have long been handled. I assume the problem is fixed by combination of those upgrades - closing untill the users upgrade to those versions and report back in case the problem is not fixed after applying them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
