potiuk commented on issue #28740:
URL: https://github.com/apache/airflow/issues/28740#issuecomment-1373549644

   > Thank for your reply @potiuk.
   > 
   > We will upgrade EKS version to 1.22 or 1.23 and get back to you on this.
   > 
   
   In the meantime I found out that it might also be that you need to upgrade 
airflow (and effectively Celery) - because this is a know bug in the celery 
liveness check - not in Airflow, not in Kubernetes. Here is a relevant issue 
https://github.com/apache/airflow/issues/21026 and 
https://github.com/apache/airflow/pull/19703 - If you have Celery before 5.2.3 
you might experience it too. So you might also check your celery version.
   
   > Regarding scheduler and Trigger pod memory leak. we yet to find the root 
cause.
   
   I think this one might not be memory leak - but increased cache memory usage 
because you do not rotate logs. 
   
   This is not a memory leak per-se - you might simply just observe wrong 
memory observed (And actually that might also be a reason for the first "memory 
leak" in celery as well - but I would rather try k8s upgrade first anyway).
   
   You have not explained which memory you observe - but what you might see is 
the usual behaviour of Linux kernel, where it keeps memory-mapped files when 
they are saved and if the files are not rotated/moved, the cache is never 
freed. This is standard behaviour of Linux (generally linux in such case will 
fill all available memory with cache to speed up file access). If this is the 
case, then it is generally harmless (the memory is freed immediately when 
needed by something else and you will never get "out-of-memory" errors because 
of it), however we implemented a way how to hint the kernel not to do it when 
we are writing logs and it's been released in Airflow 2.4.3 
https://github.com/apache/airflow/pull/27223 so upgrading to Airflow 2.4.3+ 
will solve it, if this is the case.
   
   In general - I'd STRONGLY recommend you to migrate to latest released 
Airflow (2.5.0 currently). There are mutliple hundreds of issues fixed and you 
will save yourself and people who want to help you to chase the problems that 
have long been handled. 
   
   I assume the problem is fixed by combination of those upgrades - closing 
untill the users upgrade to those versions and report back in case the problem 
is not fixed after applying them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to