WeichenXu123 opened a new pull request #25138: [SPARK-26175][PYSPARK] Closing 
stdin of the worker process right after fork
URL: https://github.com/apache/spark/pull/25138
 
 
   ## What changes were proposed in this pull request?
   
   PySpark worker daemon reads from stdin the worker PIDs to kill. 
https://github.com/apache/spark/blob/1bb60ab8392adf8b896cc04fb1d060620cf09d8a/python/pyspark/daemon.py#L127
   
   However, the worker process is a forked process from the worker daemon 
process and we didn't close stdin on the child after fork. This means the child 
and user program can read stdin as well, which blocks daemon from receiving the 
PID to kill. This can cause issues because the task reaper might detect the 
task was not terminated and eventually kill the JVM.
   
   This PR fix this by closing stdin of the worker process right after fork.
   
   ## How was this patch tested?
   
   Manually test.
   
   In `pyspark`, run:
   ```
   import subprocess
   def task(_):
     subprocess.check_output(["cat"])
   
   sc.parallelize(range(1), 1).mapPartitions(task).count()
   ```
   
   Before:
   The job will get stuck and press Ctrl+C to exit the job but the python 
worker process do not exit.
   After:
   The job exit immediately and raise error like 
`subprocess.CalledProcessError: Command '['cat']' returned non-zero exit status 
1.`. And the python worker exit immediately.
   
   Please review https://spark.apache.org/contributing.html before opening a 
pull request.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to