wjddn279 commented on code in PR #58365: URL: https://github.com/apache/airflow/pull/58365#discussion_r2545090946
########## airflow-core/src/airflow/utils/gc_utils.py: ########## @@ -0,0 +1,44 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +from __future__ import annotations + +import gc +from functools import wraps + + +def with_gc_freeze(func): + """ + Freeze the GC before executing the function and unfreeze it after execution. + + This is done to prevent memory increase due to COW (Copy-on-Write) by moving all + existing objects to the permanent generation before forking the process. After the + function executes, unfreeze is called to ensure there is no impact on gc operations + in the original running process. Review Comment: @ashb Great catch! You're absolutely right - https://github.com/python/cpython/issues/75739 does mention the necessity of gc.disable(). For FYI, to summarize briefly: when GC runs before fork, it creates empty spaces between pages, and after fork, when objects are written to those spaces in the now-shared pages, it triggers COW. This is a very interesting theory and I think it's entirely plausible. Theoretically, it makes sense that applying this would be optimal. And it's definitely feasible to implement. We could execute gc.disable() in the initial init, then enable it after library loading and worker forking. While this would need to be applied across all airflow's components, it shouldn't be problematic since gc.enable() would be called before the actual job execution of each component. However, I have some concerns about applying this to our specific case: - In our case, disable may not be necessary. The objects loaded between disable (init) and fork are not the kind that would be GC'd before fork. Even if there are some, they wouldn't be many, and even if they cause some COW, it wouldn't be a significant problem. - Unlike gc.freeze, gc.disable has much larger side effects. If the gc.enable code is not explicitly managed and is accidentally deleted in the future, it would cause enormous problems. - For the initial batch loading, we could definitely gain the benefits of this approach (since disable guarantees no memory holes). However, it's impossible to prevent this issue during the process of workers dying and restarting. I think gc.freeze() alone provides the core benefit with lower risk. But I'm open to implementing this if you think the potential gains warrant the experiment. Thoughts? @potiuk WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
