BryanCutler commented on issue #24834: [WIP][SPARK-27992][PYTHON] Synchronize with Python connection thread to propagate errors URL: https://github.com/apache/spark/pull/24834#issuecomment-500622703 From the discussion in #24677 , regarding the `DataFrame.collect()` with `collectToPython()` code path. This doesn't have quite the same issue. `collect()` and the standard `toPandas()` methods first make a Py4j call, then in Scala they run the Spark job and gather all data in the main thread. Once data is local, the serializer is started in the background thread, completing the Py4j call. If the spark job raises an error, then the Py4j call is interrupted and the error is propagated. If somehow there is an error during the serializer part, then I don't think it will be propagated to Python. `toLocalIterator()` and `toPandas()` work differently by first making a Py4j call, which sets up the socket connection and starts the background thread. The Py4j call is returned immediately, then the Spark job(s) are run in the background thread. The current approach catches a SparkException in the background thread and sends it through the serializer, where this PR returns the serving thread object in the first Py4j call, so that an additional Py4j call will synchronize and evaluate the thread future, raising the exception if occurred.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
