WeichenXu123 commented on issue #24734: [SPARK-27870][SQL][PySpark] Flush batch timely for pandas UDF (for improving pandas UDFs pipeline) URL: https://github.com/apache/spark/pull/24734#issuecomment-499393942 @BryanCutler yes, I also consider about tuning "buffer size", but one of my concern is, user may mix ML prediction with other simple udf (such as data preprocessing). Such as, some data from spark streaming, first run a simple udf, then a pipelined complex udf (which do ML prediction), if we tune the "buffer size" globally to be small, then it will hurt the first simple udf performance. @mengxr What do you think ? Thanks!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
