xinbinhuang opened a new pull request #15743:
URL: https://github.com/apache/airflow/pull/15743


   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   This is an attempt to close: #14813
   
   TODO:
   - [ ] Add tests
   
   The root cause of the original problem is that subprocess inherits the *FDs 
for stdout/stderr* but not the *sys.stdout/stderr file objects*. When the 
`StandardTaskRunner` call .`start` (which runs `airflow tasks run --raw ...` 
under the hood), it tries to "redirect" the sys.stdout file object to the 
custom 
[StreamLogWriter](https://github.com/apache/airflow/blob/d627dfa72b4395740f9a4948f549bf53a89b2285/airflow/cli/commands/task_command.py#L236)
 which writes everything into the task logger. However, when the process tries 
to start a new subprocess, the sys.stdout/stderr file objects are restored, and 
point to the original FDs. Here is a simple script that shows this behavior:
   
   ```python
   from contextlib import redirect_stdout
   import io
   import subprocess
   
   s_io = io.StringIO()
   with redirect_stdout(s_io):
       print('this is captured')
       subprocess.run(['echo', 'this is not'])
   
   print('Captured by redirect: ', s_io.getvalue())
   
   # this is not
   # Captured by redirect:  this is captured
   ```
   
   Note: So why there are *sys.stdout/stderr file objects*? They are used to 
**buffer** outputs before flushing them together to the system. (because system 
calls are expensive)
   
   Before the change in #6627, the `--raw` command is started as a subprocess 
with the stdout/stderr replaced with PIPE at the FDs level, so all 
stdout/stderr of subsequent subprocesses (i.e. the subprocess spawned by the 
python callable in the PythonOperator) will go into the PIPE.  A separate 
thread is also spawned to read from this stream and log into the task logs 
using `self.log.info`
   
https://github.com/apache/airflow/blob/25caeda58b50eae6ef425a52e794504bc63855d1/airflow/task/task_runner/base_task_runner.py#L137-L153
   
   However, in the fork path, the FD is left unchanged, so all subsequent 
subprocess will still write to the original FDs (print to the console). This PR 
essentially tries to create new PIPEs that replace the original FDs, so we can 
read the logs from the PIPE similar to the `subprocess.Popen` path.
   
   
https://github.com/apache/airflow/blob/25caeda58b50eae6ef425a52e794504bc63855d1/airflow/task/task_runner/standard_task_runner.py#L49
   
   Potentially, a better approach is to rewrite the `StreamLogWriter` to 
replace the FDs under the hood. But currently, I'm still wrapping my head 
around the whole FDs thing. Unless someone can point me a direction on how to 
rewrite the `StreamLogWriter` or suggest an alternative approach, I will leave 
it to another day and another PR...
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code change, Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in 
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to