HyukjinKwon opened a new pull request #23564: [SPARK-25992][PYTHON] Document 
SparkContext cannot be shared for multiprocessing
URL: https://github.com/apache/spark/pull/23564
 
 
   ## What changes were proposed in this pull request?
   
   This PR proposes to explicitly document that SparkContext cannot be shared 
for multiprocessing, and multi-processing execution is not guaranteed in 
PySpark.
   
   I have seen some cases that users attempt to use multiple processes via 
`multiprocessing` module time to time. For instance, see the example in the 
JIRA (https://issues.apache.org/jira/browse/SPARK-25992).
   
   Py4J itself does not support Python's multiprocessing out of the box 
(sharing the same JavaGateways for instance).
   
   In general, such pattern can cause errors that causes arbitrary symptoms 
difficult to diagnose. For instance, see the error message in JIRA:
   
   ```
   Traceback (most recent call last):
   File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 317, 
in _handle_request_noblock
       self.process_request(request, client_address)
   File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 348, 
in process_request
       self.finish_request(request, client_address)
   File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 361, 
in finish_request
       self.RequestHandlerClass(request, client_address, self)
   File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 696, 
in __init__
       self.handle()
   File "/usr/local/hadoop/spark2.3.1/python/pyspark/accumulators.py", line 
238, in handle
       _accumulatorRegistry[aid] += update
   KeyError: 0
   ``` 
   
   The root cause of this was because global `_accumulatorRegistry` is not 
shared across processes.
   
   Using thread instead of process is quite easy in Python. See `threading` vs 
`multiprocessing` in Python - they can be usually direct replacement for each 
other. For instance, Python also support threadpool as well 
(`multiprocessing.pool.ThreadPool`) which can be direct replacement of 
process-based thread pool (`multiprocessing.Pool`).
   
   ## How was this patch tested?
   
   Manually tested, and manually built the doc.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to