jhtimmins edited a comment on issue #8674:
URL: https://github.com/apache/airflow/issues/8674#issuecomment-628275845
Should Airflow support process spawning?
Tl;Dr - Spawning creates the possibility of hard-to-detect bugs in child
processes. It’s the default on MacOS, the sole option on Windows, and Linux has
`spawn` but defaults to `fork`.
Overview
* Prior to Python 3.8, multiprocessing.Process used `fork` by default on
Linux and macOS.
* As of 3.8, `spawn` is the default method on MacOS. “The fork start method
should be considered unsafe as it can lead to crashes of the subprocess”.
* Windows continues to only support `spawn`.
Differences - Fork vs Spawn
* Spawn starts a new Python interpreter process for the child. Only a small
subset of data from the parent’s scope is passed to the child. Safer than Fork,
but slower.
* Forking uses os.fork() to fork the parent process. All parent memory
objects are available to the child.
Issue in 3.8:
* Actually a Mac issue
* Calling an Objective-C method between fork() and exec() may cause the
child process to fail bc of a memory lock issue.
* As of macOS 10.13, Apple has added a check for these cases, automatically
crashing threads where this behavior occurs.
* Python 3.7 disables this check. 3.8 gets around the issue by spawning new
processes instead forking.
Problems with spawn:
Spawn doesn’t have access to the parent’s state. This can cause unexpected
behavior changes. For example, I noticed that a test passed on both `spawn` and
`fork`, but spawning took substantially longer to finish. The was caused by a
config value set in the parent process, which the child couldn’t access. When
the child looked up the config value, the default value was returned. This
caused ~13 dags to be tested, instead of the single dag that the test required.
This class of bug can be mitigated by using environment variables to set
config values globally, rather than local to the parent process.
Using environment variables to pass state presents two issues:
1. Adding to global state can be dangerous, as it is hard to tell what other
code is reading/writing that state. The risk of conflict is high.
2. Airflow has a substantial amount of code already using configs to share
state between processes. It isn’t obvious where this is/isn’t an issue,
especially when silent failure is a possibility.
Because of these issues, we should thoughtfully consider whether to use the
system defaults, or to hardcode to only using on Mac and Linux (h/t to Daniel
for suggesting this discussion).
Considerations:
* While Python 3.8 can use `fork` instead of `spawn`, doing so reintroduces
the original possibility of child process failure due to memory lock conflicts.
* There is some discussion among Python maintainers whether to standardize
with `spawn` across all systems, or whether they should each use their own
default. They chose not to take this approach for 3.8, but it’s worth noting
that it’s an option.
* Using Fork on macOS is essentially a continuation of the status quo on all
three platforms.
* macOS will work the same as it did for 3.7. Tests will pass
* Given that Windows only supported spawning all along, I assume that
neither the scheduler nor the tests have worked (ever?). I have not tested this
personally.
* Linux will not change.
Recommendations:
The choice here depends where stability or multi-system support is a higher
priority to the Airflow project.
If stability > multi-system support - If stability is key, then I recommend
we hardcode forking all systems. This will continue to make Airflow
incompatible with Windows, and introduces the possibility of memory bugs on
Mac. The benefit is that we don’t need to store state globally on macOS and we
don’t need to hunt down pieces of code that are currently failing silently.
If multi-system support > stability - We should continue to support the
default method on every system, and address bugs when they arise. We should
come up with a standardized process for storing configuration values in either
config or environment variables. Right now those methods are closely coupled,
increasing the likelihood of errors.
I lean towards multi-system support, with the caveat that this will likely
cause hard to find bugs. Those bugs will probably be localized to MacOS, which
should lower the risk for Airflow instances running on production Linux
machines.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]