potiuk commented on code in PR #58365:
URL: https://github.com/apache/airflow/pull/58365#discussion_r2539692368


##########
airflow-core/src/airflow/utils/gc_utils.py:
##########
@@ -0,0 +1,44 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from __future__ import annotations
+
+import gc
+from functools import wraps
+
+
+def with_gc_freeze(func):
+    """
+    Freeze the GC before executing the function and unfreeze it after 
execution.
+
+    This is done to prevent memory increase due to COW (Copy-on-Write) by 
moving all
+    existing objects to the permanent generation before forking the process. 
After the
+    function executes, unfreeze is called to ensure there is no impact on gc 
operations
+    in the original running process.

Review Comment:
   Yeah. 
   
   I spend quite a bit of time understanding how gc.freeze works and it looks 
very sound (and very nicely addresses the problems I earlier hypothesised about 
after listening about fork and gc and refcount in the Py.Core podcast.
   
   The thing is that we **absolutely** want the frozen object not be collected 
- because due to reference counting it's the collecting itself that causes COW 
and it starts copying the memory blocks where the objects happened to be out of 
shared memory into copy of the blocks in the subprocess.
   
   Basically. Any object that is garbage-collectable and not frozen at the 
moment of forking, is basically GUARANTEED  to have their memory page 
multiplied *n times (n = number of forked processes). And since we are 
importing a lot of dependent packages, we have absolutely no idea how much of 
the memory is currently used and are garbage-collectable.
   
   It's very nicely explained and discussed here: 
https://github.com/python/cpython/issues/75739 when the gc.freeze was added.



##########
airflow-core/src/airflow/utils/gc_utils.py:
##########
@@ -0,0 +1,44 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from __future__ import annotations
+
+import gc
+from functools import wraps
+
+
+def with_gc_freeze(func):
+    """
+    Freeze the GC before executing the function and unfreeze it after 
execution.
+
+    This is done to prevent memory increase due to COW (Copy-on-Write) by 
moving all
+    existing objects to the permanent generation before forking the process. 
After the
+    function executes, unfreeze is called to ensure there is no impact on gc 
operations
+    in the original running process.

Review Comment:
   Yeah. 
   
   I spent quite a bit of time understanding how gc.freeze works and it looks 
very sound (and very nicely addresses the problems I earlier hypothesised about 
after listening about fork and gc and refcount in the Py.Core podcast.
   
   The thing is that we **absolutely** want the frozen object not be collected 
- because due to reference counting it's the collecting itself that causes COW 
and it starts copying the memory blocks where the objects happened to be out of 
shared memory into copy of the blocks in the subprocess.
   
   Basically. Any object that is garbage-collectable and not frozen at the 
moment of forking, is basically GUARANTEED  to have their memory page 
multiplied *n times (n = number of forked processes). And since we are 
importing a lot of dependent packages, we have absolutely no idea how much of 
the memory is currently used and are garbage-collectable.
   
   It's very nicely explained and discussed here: 
https://github.com/python/cpython/issues/75739 when the gc.freeze was added.



##########
airflow-core/src/airflow/utils/gc_utils.py:
##########
@@ -0,0 +1,44 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from __future__ import annotations
+
+import gc
+from functools import wraps
+
+
+def with_gc_freeze(func):
+    """
+    Freeze the GC before executing the function and unfreeze it after 
execution.
+
+    This is done to prevent memory increase due to COW (Copy-on-Write) by 
moving all
+    existing objects to the permanent generation before forking the process. 
After the
+    function executes, unfreeze is called to ensure there is no impact on gc 
operations
+    in the original running process.

Review Comment:
   Yeah. 
   
   I spent quite a bit of time understanding how gc.freeze works and it looks 
very sound (and very nicely addresses the problems I earlier hypothesised about 
after listening about fork and gc and refcount in the Py.Core podcast.
   
   The thing is that we **absolutely** want the frozen objects not to be 
collected in forks - because due to reference counting it's the collecting 
itself that causes COW and it starts copying the memory blocks where the 
objects happened to be out of shared memory into copy of the blocks in the 
subprocess.
   
   Basically. Any object that is garbage-collectable and not frozen at the 
moment of forking, is basically GUARANTEED  to have their memory page 
multiplied *n times (n = number of forked processes). And since we are 
importing a lot of dependent packages, we have absolutely no idea how much of 
the memory is currently used and are garbage-collectable.
   
   It's very nicely explained and discussed here: 
https://github.com/python/cpython/issues/75739 when the gc.freeze was added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to