wjddn279 commented on code in PR #58365:
URL: https://github.com/apache/airflow/pull/58365#discussion_r2545090946


##########
airflow-core/src/airflow/utils/gc_utils.py:
##########
@@ -0,0 +1,44 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from __future__ import annotations
+
+import gc
+from functools import wraps
+
+
+def with_gc_freeze(func):
+    """
+    Freeze the GC before executing the function and unfreeze it after 
execution.
+
+    This is done to prevent memory increase due to COW (Copy-on-Write) by 
moving all
+    existing objects to the permanent generation before forking the process. 
After the
+    function executes, unfreeze is called to ensure there is no impact on gc 
operations
+    in the original running process.

Review Comment:
   @ashb 
   
   Great catch! You're absolutely right - 
https://github.com/python/cpython/issues/75739 does mention the necessity of 
gc.disable(). For FYI, to summarize briefly: when GC runs before fork, it 
creates empty spaces between pages, and after fork, when objects are written to 
those spaces in the now-shared pages, it triggers COW. This is a very 
interesting theory and I think it's entirely plausible.
   
   Theoretically, it makes sense that applying this would be optimal. And it's 
definitely feasible to implement. We could execute gc.disable() in the initial 
init, then enable it after library loading and worker forking. While this would 
need to be applied across all airflow's components, it shouldn't be problematic 
since gc.enable() would be called before the actual job execution of each 
component.
   
   However, I have some concerns about applying this to our specific case:
   - In our case, disable may not be necessary. The objects loaded between 
disable (init) and fork are not the kind that would be GC'd before fork. Even 
if there are some, they wouldn't be many, and even if they cause some COW, it 
wouldn't be a significant problem.
   - Unlike gc.freeze, gc.disable has much larger side effects. If the 
gc.enable code is not explicitly managed and is accidentally deleted in the 
future, it would cause enormous problems.
   - For the initial batch loading, we could definitely gain the benefits of 
this approach (since disable guarantees no memory holes). However, it's 
impossible to prevent this issue during the process of workers dying and 
restarting.
   
   I think gc.freeze() alone provides the core benefit with lower risk.  But 
I'm open to implementing this if you think the potential gains warrant the 
experiment. Thoughts?
   
   @potiuk WDYT?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to