Hi Python Spark Developers & Users, As Datasets/DataFrames are becoming the core building block of Spark, and as someone who cares about Python Spark performance, I've been looking more at PySpark UDF performance.
I've got an early WIP/request for comments pull request open <https://github.com/apache/spark/pull/13571> with a corresponding design document <https://docs.google.com/document/d/1L-F12nVWSLEOW72sqOn6Mt1C0bcPFP9ck7gEMH2_IXE/edit> and JIRA (SPARK-15369) <https://issues.apache.org/jira/browse/SPARK-15369> that allows for selective UDF evaluation in Jython <http://www.jython.org/>. Now that Spark 2.0.1 is out I'd really love peoples input or feedback on this proposal so I can circle back with a more complete PR :) I'd love to hear from people using PySpark if this is something which looks interesting (as well as the PySpark developers) for some of the open questions :) For users: If you have simple Python UDFs (or even better UDFs and datasets) that you can share for bench-marking it would be really useful to be able to add them to the bench-marking I've been looking at in the design doc. It would also be useful to know if some, many, or none, of your UDFs can be evaluated by Jython. If you have UDF you aren't comfortable sharing on-list feel free to each out to me directly. Some general open questions: 1) The draft PR does some magic** to allow being passed in functions at least some of the time - is that something which people are interested in or would it be better to leave the magic out and just require a string representing the lambda be passed in? 2) Would it be useful to provide easy steps to use JyNI <http://jyni.org/> (its LGPL licensed <https://www.gnu.org/licenses/lgpl-3.0.en.html> so I don't think we we can include it out of the bo <https://www.apache.org/legal/resolved.html#category-x>x - but we could try and make it easy for users to link with if its important)? 3) While we have a 2x speedup for tokenization/wordcount (getting close to native scala perf) - what is performance like for other workloads (please share your desired UDFs/workloads for my evil bench-marking plans)? 4) What does the eventual Dataset API look like for Python? (This could partially influence #1)? 5) How important it is to not add the Jython dependencies to the weight for non-Python users (and if desired which work around to chose - maybe something like spark-hive?) 6) Do you often chain PySpark UDF operations and is that something we should try and optimize for in Jython as well? 7) How many of your Python UDFs can / can not be evaluated in Jython for one reason or another? 8) Do your UDFs depend on Spark accumulators or broadcast values? 9) What am I forgetting in my coffee fueled happiness? Cheers, Holden :) *Bench-marking has been very limited 2~3X improvement likely different for "real" work loads (unless you really like doing wordcount :p :)) ** Note: magic depends on dill <https://pypi.python.org/pypi/dill>. P.S. I leave you with this optimistic 80s style intro screen <https://twitter.com/holdenkarau/status/783762213408497670> :) Also if anyone happens to be going to PyData DC <http://pydata.org/dc2016/> this weekend I'd love to chat with you in person about this (and of course circle it back to the mailing list). -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau