Hi Python Spark Developers & Users,

As Datasets/DataFrames are becoming the core building block of Spark, and
as someone who cares about Python Spark performance, I've been looking more
at PySpark UDF performance.

I've got an early WIP/request for comments pull request open
<https://github.com/apache/spark/pull/13571> with a corresponding design
document
<https://docs.google.com/document/d/1L-F12nVWSLEOW72sqOn6Mt1C0bcPFP9ck7gEMH2_IXE/edit>
and
JIRA (SPARK-15369) <https://issues.apache.org/jira/browse/SPARK-15369> that
allows for selective UDF evaluation in Jython <http://www.jython.org/>. Now
that Spark 2.0.1 is out I'd really love peoples input or feedback on this
proposal so I can circle back with a more complete PR :) I'd love to hear
from people using PySpark if this is something which looks interesting (as
well as the PySpark developers) for some of the open questions :)

For users: If you have simple Python UDFs (or even better UDFs and
datasets) that you can share for bench-marking it would be really useful to
be able to add them to the bench-marking I've been looking at in the design
doc. It would also be useful to know if some, many, or none, of your UDFs
can be evaluated by Jython. If you have UDF you aren't comfortable sharing
on-list feel free to each out to me directly.

Some general open questions:

1) The draft PR does some magic** to allow being passed in functions at
least some of the time - is that something which people are interested in
or would it be better to leave the magic out and just require a string
representing the lambda be passed in?

2) Would it be useful to provide easy steps to use JyNI <http://jyni.org/>
 (its LGPL licensed <https://www.gnu.org/licenses/lgpl-3.0.en.html> so I
don't think we we can include it out of the bo
<https://www.apache.org/legal/resolved.html#category-x>x - but we could try
and make it easy for users to link with if its important)?

3) While we have a 2x speedup for tokenization/wordcount (getting close to
native scala perf) - what is performance like for other workloads (please
share your desired UDFs/workloads for my evil bench-marking plans)?

4) What does the eventual Dataset API look like for Python? (This could
partially influence #1)?

5) How important it is to not add the Jython dependencies to the weight for
non-Python users (and if desired which work around to chose - maybe
something like spark-hive?)

6) Do you often chain PySpark UDF operations and is that something we
should try and optimize for in Jython as well?

7) How many of your Python UDFs can / can not be evaluated in Jython for
one reason or another?

8) Do your UDFs depend on Spark accumulators or broadcast values?

9) What am I forgetting in my coffee fueled happiness?

Cheers,

Holden :)

*Bench-marking has been very limited 2~3X improvement likely different for
"real" work loads (unless you really like doing wordcount :p :))
** Note: magic depends on dill <https://pypi.python.org/pypi/dill>.

P.S.

I leave you with this optimistic 80s style intro screen
<https://twitter.com/holdenkarau/status/783762213408497670> :)
Also if anyone happens to be going to PyData DC <http://pydata.org/dc2016/>
this weekend I'd love to chat with you in person about this (and of course
circle it back to the mailing list).
-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Reply via email to