[GitHub] [spark] dvogelbacher commented on issue #24677: [SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled

GitBox Wed, 22 May 2019 11:43:57 -0700

dvogelbacher commented on issue #24677: [SPARK-27805][PYTHON] Propagate 
SparkExceptions during toPandas with arrow enabled
URL: https://github.com/apache/spark/pull/24677#issuecomment-494884105
 
 
   `collectAsArrowToPython` will just return the socket info from 
`PythonRDD.serveToStream("serve-Arrow")`. The exception will occur during the 
`runJob` which is inside the `serveToStream`, which will be executed in a 
background thread. When the background thread encounters an exception it will 
close the `OutputStream`.
   The `ArrowStreamSerializer` in the python process will then think that it 
read all the batches after which the `ArrowCollectSerializer` will try to read 
the batch order indices and throw an `EofError` as those were never written.
   
   Also note that before https://github.com/apache/spark/pull/22275 (which 
introduced the batch order indices) this would not have resulted in any error 
on the python side. We would have just dropped some partitions without throwing 
an error. Now at least we get an error but it is not a very helpful one.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] dvogelbacher commented on issue #24677: [SPARK-27805][PYTHON] Propagate SparkExceptions during toPandas with arrow enabled

Reply via email to