[GitHub] [spark] BryanCutler commented on issue #24734: [SPARK-27870][SQL][PySpark] Flush each batch for pandas UDF (for improving pandas UDFs pipeline)

GitBox Fri, 31 May 2019 16:49:18 -0700

BryanCutler commented on issue #24734: [SPARK-27870][SQL][PySpark] Flush each 
batch for pandas UDF (for improving pandas UDFs pipeline)
URL: https://github.com/apache/spark/pull/24734#issuecomment-497891982
 
 
   @WeichenXu123 this change might improve perf for your case where a 
pandas_udf takes massive amount of time on small data, but my concern is it 
could have a negative effect on other cases where IO is the main bottleneck. 
This PR changes all pandas_udf IO, not just scalar ones too.  Could you not 
post a simple benchmark script and run some numbers?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] BryanCutler commented on issue #24734: [SPARK-27870][SQL][PySpark] Flush each batch for pandas UDF (for improving pandas UDFs pipeline)

Reply via email to