dvogelbacher commented on issue #24677: [SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled URL: https://github.com/apache/spark/pull/24677#issuecomment-494884105 `collectAsArrowToPython` will just return the socket info from `PythonRDD.serveToStream("serve-Arrow")`. The exception will occur during the `runJob` which is inside the `serveToStream`, which will be executed in a background thread. When the background thread encounters an exception it will close the `OutputStream`. The `ArrowStreamSerializer` in the python process will then think that it read all the batches after which the `ArrowCollectSerializer` will try to read the batch order indices and throw an `EofError` as those were never written. Also note that before https://github.com/apache/spark/pull/22275 (which introduced the batch order indices) this would not have resulted in any error on the python side. We would have just dropped some partitions without throwing an error. Now at least we get an error but it is not a very helpful one.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
