potiuk commented on code in PR #58365:
URL: https://github.com/apache/airflow/pull/58365#discussion_r2545201517


##########
airflow-core/src/airflow/utils/gc_utils.py:
##########
@@ -0,0 +1,44 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from __future__ import annotations
+
+import gc
+from functools import wraps
+
+
+def with_gc_freeze(func):
+    """
+    Freeze the GC before executing the function and unfreeze it after 
execution.
+
+    This is done to prevent memory increase due to COW (Copy-on-Write) by 
moving all
+    existing objects to the permanent generation before forking the process. 
After the
+    function executes, unfreeze is called to ensure there is no impact on gc 
operations
+    in the original running process.

Review Comment:
   Yes. I think in our case it's mostly about possible optimisation  but not a 
break-though. But maybe worth trying.
   
   I think it would be worth seeing the difference, but I quite agree in our 
case we would probably have to run gc.disable() somwhere **super early** in the 
imports of airflow. Unfortunately a **lot** is happening when you run `import 
airflow` (still - even when we split and isolate pieces of airflow)  - so a lot 
of objects are already created by the time we get to "real airflow code". But 
we can try to see what effect it is going to have. 
   
   Since we are starting most of the processes relatively early, I guess the 
memory is not as fragmented by that time and gIc did not really leave a lot of 
holes - gc is only really kicking-in when there garbage-collectable cycles.  
And I would say we won't be able to reason about it - because there are too 
many variables.  And you are right @wjddn279  - it won't help for the restarted 
executor forks.
   
   I think running experiment with it will not hurt to see the effects.
   
   if we find that it has further significant gains, we could do something even 
deeper - we could disable gc at the very beginning, fork scheduler into 
"executor supervisor" process - and leave that one with disabled gc - and 
enable it quickly in scheduler after that initial fork. This would not 
**nothing** except watching for executor forks and restarting them. I think it 
would be worth trying also with this scenario.
   
   But likely this might have some other side-effects and require a lot more 
code changes and it  could 2nd and 3rd step of optimisation - that we could do 
as a follow-up. We definitely do not have to hit "optimal" path, I think 
following pareto's principle with 80% gains with 20% work might be a good idea.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to