[GitHub] [spark] WeichenXu123 commented on issue #24734: [SPARK-27870][SQL][PySpark] Flush batch timely for pandas UDF (for improving pandas UDFs pipeline)

GitBox Thu, 06 Jun 2019 01:10:20 -0700

WeichenXu123 commented on issue #24734: [SPARK-27870][SQL][PySpark] Flush batch 
timely for pandas UDF (for improving pandas UDFs pipeline)
URL: https://github.com/apache/spark/pull/24734#issuecomment-499393942
 
 
   @BryanCutler yes, I also consider about tuning "buffer size", but one of my 
concern is, user may mix ML prediction with other simple udf (such as data 
preprocessing). Such as, some data from spark streaming, first run a simple 
udf, then a pipelined complex udf (which do ML prediction), if we tune the 
"buffer size" globally to be small, then it will hurt the first simple udf 
performance.
   @mengxr What do you think ?
   
   Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] WeichenXu123 commented on issue #24734: [SPARK-27870][SQL][PySpark] Flush batch timely for pandas UDF (for improving pandas UDFs pipeline)

Reply via email to