BryanCutler commented on issue #24834: [WIP][SPARK-27992][PYTHON] Synchronize 
with Python connection thread to propagate errors
URL: https://github.com/apache/spark/pull/24834#issuecomment-500622703
 
 
   From the discussion in #24677 , regarding the `DataFrame.collect()` with 
`collectToPython()` code path. This doesn't have quite the same issue. 
`collect()` and the standard `toPandas()` methods first make a Py4j call, then 
in Scala they run the Spark job and gather all data in the main thread. Once 
data is local, the serializer is started in the background thread, completing 
the Py4j call. If the spark job raises an error, then the Py4j call is 
interrupted and the error is propagated. If somehow there is an error during 
the serializer part, then I don't think it will be propagated to Python.
   
   `toLocalIterator()` and `toPandas()` work differently by first making a Py4j 
call, which sets up the socket connection and starts the background thread. The 
Py4j call is returned immediately, then the Spark job(s) are run in the 
background thread. The current approach catches a SparkException in the 
background thread and sends it through the serializer, where this PR returns 
the serving thread object in the first Py4j call, so that an additional Py4j 
call will synchronize and evaluate the thread future, raising the exception if 
occurred.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to