xinbinhuang opened a new pull request #15743: URL: https://github.com/apache/airflow/pull/15743
<!-- Thank you for contributing! Please make sure that your code changes are covered with tests. And in case of new features or big changes remember to adjust the documentation. Feel free to ping committers for the review! In case of existing issue, reference it using one of the following: related: #ISSUE How to write a good git commit message: http://chris.beams.io/posts/git-commit/ --> This is an attempt to close: #14813 TODO: - [ ] Add tests The root cause of the original problem is that subprocess inherits the *FDs for stdout/stderr* but not the *sys.stdout/stderr file objects*. When the `StandardTaskRunner` call .`start` (which runs `airflow tasks run --raw ...` under the hood), it tries to "redirect" the sys.stdout file object to the custom [StreamLogWriter](https://github.com/apache/airflow/blob/d627dfa72b4395740f9a4948f549bf53a89b2285/airflow/cli/commands/task_command.py#L236) which writes everything into the task logger. However, when the process tries to start a new subprocess, the sys.stdout/stderr file objects are restored, and point to the original FDs. Here is a simple script that shows this behavior: ```python from contextlib import redirect_stdout import io import subprocess s_io = io.StringIO() with redirect_stdout(s_io): print('this is captured') subprocess.run(['echo', 'this is not']) print('Captured by redirect: ', s_io.getvalue()) # this is not # Captured by redirect: this is captured ``` Note: So why there are *sys.stdout/stderr file objects*? They are used to **buffer** outputs before flushing them together to the system. (because system calls are expensive) Before the change in #6627, the `--raw` command is started as a subprocess with the stdout/stderr replaced with PIPE at the FDs level, so all stdout/stderr of subsequent subprocesses (i.e. the subprocess spawned by the python callable in the PythonOperator) will go into the PIPE. A separate thread is also spawned to read from this stream and log into the task logs using `self.log.info` https://github.com/apache/airflow/blob/25caeda58b50eae6ef425a52e794504bc63855d1/airflow/task/task_runner/base_task_runner.py#L137-L153 However, in the fork path, the FD is left unchanged, so all subsequent subprocess will still write to the original FDs (print to the console). This PR essentially tries to create new PIPEs that replace the original FDs, so we can read the logs from the PIPE similar to the `subprocess.Popen` path. https://github.com/apache/airflow/blob/25caeda58b50eae6ef425a52e794504bc63855d1/airflow/task/task_runner/standard_task_runner.py#L49 Potentially, a better approach is to rewrite the `StreamLogWriter` to replace the FDs under the hood. But currently, I'm still wrapping my head around the whole FDs thing. Unless someone can point me a direction on how to rewrite the `StreamLogWriter` or suggest an alternative approach, I will leave it to another day and another PR... --- **^ Add meaningful description above** Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)** for more information. In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed. In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
