[GitHub] [airflow] xinbinhuang opened a new pull request #15743: Redirect forked process output to logger

GitBox Sun, 31 Oct 2021 21:53:44 -0700


xinbinhuang opened a new pull request #15743:
URL: https://github.com/apache/airflow/pull/15743

<!--
Thank you for contributing! Please make sure that your code changes
are covered with tests. And in case of new features or big changes
remember to adjust the documentation.

Feel free to ping committers for the review!

In case of existing issue, reference it using one of the following:

related: #ISSUE

How to write a good git commit message:
http://chris.beams.io/posts/git-commit/
-->
This is an attempt to close: #14813

TODO:
- [ ] Add tests

The root cause of the original problem is that subprocess inherits the *FDs
for stdout/stderr* but not the *sys.stdout/stderr file objects*. When the
`StandardTaskRunner` call .`start` (which runs `airflow tasks run --raw ...`
under the hood), it tries to "redirect" the sys.stdout file object to the
custom
[StreamLogWriter](https://github.com/apache/airflow/blob/d627dfa72b4395740f9a4948f549bf53a89b2285/airflow/cli/commands/task_command.py#L236)
which writes everything into the task logger. However, when the process tries
to start a new subprocess, the sys.stdout/stderr file objects are restored, and
point to the original FDs. Here is a simple script that shows this behavior:

```python
from contextlib import redirect_stdout
import io
import subprocess

s_io = io.StringIO()
with redirect_stdout(s_io):
print('this is captured')
subprocess.run(['echo', 'this is not'])

print('Captured by redirect: ', s_io.getvalue())

# this is not
# Captured by redirect: this is captured
```

Note: So why there are *sys.stdout/stderr file objects*? They are used to
**buffer** outputs before flushing them together to the system. (because system
calls are expensive)

Before the change in #6627, the `--raw` command is started as a subprocess
with the stdout/stderr replaced with PIPE at the FDs level, so all
stdout/stderr of subsequent subprocesses (i.e. the subprocess spawned by the
python callable in the PythonOperator) will go into the PIPE. A separate
thread is also spawned to read from this stream and log into the task logs
using `self.log.info`

https://github.com/apache/airflow/blob/25caeda58b50eae6ef425a52e794504bc63855d1/airflow/task/task_runner/base_task_runner.py#L137-L153

However, in the fork path, the FD is left unchanged, so all subsequent
subprocess will still write to the original FDs (print to the console). This PR
essentially tries to create new PIPEs that replace the original FDs, so we can
read the logs from the PIPE similar to the `subprocess.Popen` path.

https://github.com/apache/airflow/blob/25caeda58b50eae6ef425a52e794504bc63855d1/airflow/task/task_runner/standard_task_runner.py#L49

Potentially, a better approach is to rewrite the `StreamLogWriter` to
replace the FDs under the hood. But currently, I'm still wrapping my head
around the whole FDs thing. Unless someone can point me a direction on how to
rewrite the `StreamLogWriter` or suggest an alternative approach, I will leave
it to another day and another PR...

---
**^ Add meaningful description above**

Read the **[Pull Request
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)**
for more information.
In case of fundamental code change, Airflow Improvement Proposal
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
is needed.
In case of a new dependency, check compliance with the [ASF 3rd Party
License Policy](https://www.apache.org/legal/resolved.html#category-x).
In case of backwards incompatible changes please leave a note in
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] xinbinhuang opened a new pull request #15743: Redirect forked process output to logger

Reply via email to