jhtimmins edited a comment on issue #8674:
URL: https://github.com/apache/airflow/issues/8674#issuecomment-628275845


   Should Airflow support process spawning?
   
   Tl;Dr - Spawning creates the possibility of hard-to-detect bugs in child 
processes. It’s the default on MacOS, the sole option on Windows, and Linux has 
`spawn` but defaults to `fork`. 
   
   Overview
   * Prior to Python 3.8, multiprocessing.Process used `fork` by default on 
Linux and macOS.
   * As of 3.8, `spawn` is the default method on MacOS. “The fork start method 
should be considered unsafe as it can lead to crashes of the subprocess”. 
   * Windows continues to only support `spawn`.
   
   Differences - Fork vs Spawn
   * Spawn starts a new Python interpreter process for the child. Only a small 
subset of data from the parent’s scope is passed to the child. Safer than Fork, 
but slower.
   * Forking uses os.fork() to fork the parent process. All parent memory 
objects are available to the child.
   
   Issue in 3.8:
   * Actually a Mac issue
   * Calling an Objective-C method between fork() and exec() may cause the 
child process to fail bc of a memory lock issue.
   * As of macOS 10.13, Apple has added a check for these cases, automatically 
crashing threads where this behavior occurs.
   * Python 3.7 disables this check. 3.8 gets around the issue by spawning new 
processes instead forking.
   
   Problems with spawn:
   Spawn doesn’t have access to the parent’s state. This can cause unexpected 
behavior changes. For example, I noticed that a test passed on both `spawn` and 
`fork`, but spawning took substantially longer to finish. The was caused by a 
config value set in the parent process, which the child couldn’t access. When 
the child looked up the config value, the default value was returned. This 
caused ~13 dags to be tested, instead of the single dag that the test required.
   
   This class of bug can be mitigated by using environment variables to set 
config values globally, rather than local to the parent process.
   
   Using environment variables to pass state presents two issues:
   1. Adding to global state can be dangerous, as it is hard to tell what other 
code is reading/writing that state. The risk of conflict is high.
   2. Airflow has a substantial amount of code already using configs to share 
state between processes. It isn’t obvious where this is/isn’t an issue, 
especially when silent failure is a possibility.
   
   Because of these issues, we should thoughtfully consider whether to use the 
system defaults, or to hardcode to only using on Mac and Linux (h/t to Daniel 
for suggesting this discussion). 
   
   Considerations:
   * While Python 3.8 can use `fork` instead of `spawn`, doing so reintroduces 
the original possibility of child process failure due to memory lock conflicts.
   * There is some discussion among Python maintainers whether to standardize 
with `spawn` across all systems, or whether they should each use their own 
default. They chose not to take this approach for 3.8, but it’s worth noting 
that it’s an option.
   * Using Fork on macOS is essentially a continuation of the status quo on all 
three platforms. 
       * macOS will work the same as it did for 3.7. Tests will pass 
       * Given that Windows only supported spawning all along, I assume that 
neither the scheduler nor the tests have worked (ever?). I have not tested this 
personally. 
       * Linux will not change.
   
   Recommendations:
   The choice here depends where stability or multi-system support is a higher 
priority to the Airflow project.
   
   If stability > multi-system support - If stability is key, then I recommend 
we hardcode forking all systems. This will continue to make Airflow 
incompatible with Windows, and introduces the possibility of memory bugs on 
Mac. The benefit is that we don’t need to store state globally on macOS and we 
don’t need to hunt down pieces of code that are currently failing silently.
   
   If multi-system support > stability - We should continue to support the 
default method on every system, and address bugs when they arise. We should 
come up with a standardized process for storing configuration values in either 
config or environment variables. Right now those methods are closely coupled, 
increasing the likelihood of errors.
   
   I lean towards multi-system support, with the caveat that this will likely 
cause hard to find bugs. Those bugs will probably be localized to MacOS, which 
should lower the risk for Airflow instances running on production Linux 
machines.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to