potiuk commented on PR #39484: URL: https://github.com/apache/airflow/pull/39484#issuecomment-2100687710
This is a very risky change if you have not thoroughly tested various configuration and I think it's for the worse on production systems. When you are using it on MacOS, yes it will be a bit slow because default multiprocessing context on MacOS is `spawn` https://docs.python.org/3/library/multiprocessing.html#multiprocessing-start-methods but on linux it is `fork` and this implementation is optimized for production case where Airflow is run on Linux. Have you considered impact on production for this change? The overhead for production (Linux and system where fork is supported) is minimal, because `fork` model is used for the processes. Forking a process that does very little is very inexpensive in this model, because the heap memory is in "copy on write" mode, which means that forking is nearly instanteneous and until some memory is modiified, no extra memory is used. On the other hand, the change you propose will impact memory used in a long term as the long running processes might accumulate memory when run for a long time and run different callables. So it might cause memory leak, not mentioning that in the original model the forked processes are only needed for a very short time - to send tasks to celery and for example they are not consuming any resources when there are no tasks to run (because they are stopped right after they complete, Have you tested it beyond MacOS to see what kind of performance gains you have and what impact on memory usage and resource usage at scale? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
