GitHub user holdenk opened a pull request:
https://github.com/apache/spark/pull/13571
[SPARK-15369][WIP][RFC][PySpark][SQL] Expose potential to use Jython for
PySpark UDFs
This is an early work in progress / RFC PR to see what interest exists /
thoughts are around offering Jython for some PySpark UDF evaluation.
## What changes were proposed in this pull request?
Transferring data from the JVM to the Python executor can be a substantial
bottleneck. While Jython is not suitable for all UDFs or map functions, it may
be suitable for some simple ones. An early draft of this, with a tokenization
UDF, found Jython UDF to be ~65% faster than Python UDF and ~2% slower than a
native Scala UDF for multiple runs. The first run with a Jython UDF involves
starting the Jython interpreter on the workers, but even in those cases it
outperforms regular PySpark UDFs by ~20%.
## How was this patch tested?
unit tests, doc tests, and benchmark (see
https://docs.google.com/document/d/1L-F12nVWSLEOW72sqOn6Mt1C0bcPFP9ck7gEMH2_IXE/edit?usp=sharing
).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/holdenk/spark
SPARK-15369-investigate-selectively-using-jython
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13571.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13571
----
commit 09d0d5cc597bb76b5deee91e7aa1705b5e0281f9
Author: Holden Karau <[email protected]>
Date: 2016-05-20T18:57:53Z
Start work on Jython UDF support
commit 64954e4f1540fe8f82aa6f6dfbb5871072aaf4e6
Author: Holden Karau <[email protected]>
Date: 2016-05-20T22:01:08Z
More work on the calling it from Python side
commit b8201358dc5274c2567843d5cb5d1d6f6a086925
Author: Holden Karau <[email protected]>
Date: 2016-05-20T23:10:53Z
Ok the basics work but maybe not to use reflection in base for int/long
commit f6462e376646278426e286251cbe0b1389623d5b
Author: Holden Karau <[email protected]>
Date: 2016-05-21T00:48:35Z
Ok it now works for single elem inputs and integer/array of string returns
commit 0c16ff3116116cbdf8e74d51ecf6f75cdf266960
Author: Holden Karau <[email protected]>
Date: 2016-05-21T02:16:54Z
Take zero to 2 arguments
commit 0a74efcfa1c3f804bce907544b470c253b5b9ca8
Author: Holden Karau <[email protected]>
Date: 2016-05-21T02:23:03Z
PyLint and expand a bit on the error cases
commit 21e9f4ec3c0c8a24220facbe52a1d21c42c1e531
Author: Holden Karau <[email protected]>
Date: 2016-05-21T06:02:16Z
Switch from json back to pickle
commit 4d726474ae3ae81f1f65101369949de25ce6e375
Author: Holden Karau <[email protected]>
Date: 2016-05-21T07:14:26Z
Reeeeealllllly sketchy Row-ish-support-ish
commit c9782154d5d3456f66b3e7b8d6aa57e2c6bbee4f
Author: Holden Karau <[email protected]>
Date: 2016-05-21T07:16:44Z
Style fixes
commit fed0beb72d68454a28400eb90964fe7968a9ee4d
Author: Holden Karau <[email protected]>
Date: 2016-05-22T02:39:12Z
Use generic Row
commit d530712d369b72ec9c01cf2c14690615695bcd9e
Author: Holden Karau <[email protected]>
Date: 2016-05-22T03:21:47Z
Start on a bit of ScalaDoc and mark classes as private
commit 68ba3b826ab5188402cd7cb080386af0cd49638a
Author: Holden Karau <[email protected]>
Date: 2016-05-22T03:22:11Z
Remove debug prints
commit b1b39bbbe29c409fd5f910aa4e5252bbb5a4bf27
Author: Holden Karau <[email protected]>
Date: 2016-05-22T03:26:39Z
Doc params
commit 27882857c57f2a7a1d224cd49257646b7f5099a7
Author: Holden Karau <[email protected]>
Date: 2016-05-22T04:12:40Z
Start adding some tests
commit 6e964307d052482e702c98f2cb3cdde11e75bf83
Author: Holden Karau <[email protected]>
Date: 2016-05-22T04:29:29Z
Remove some ignores
commit b6f4aa37d18004e871b82dc7c5aff72f0902498e
Author: Holden Karau <[email protected]>
Date: 2016-05-22T04:58:59Z
Merge branch 'master' into SPARK-15369-investigate-selectively-using-jython
commit 87900b4a1432bd0a1606812777c01b4945cceb78
Author: Holden Karau <[email protected]>
Date: 2016-05-22T07:58:07Z
python3 compatability (yay), also doctests + dill doesn't play super well
together so limit the doctests. TODO: copy more tests in tests.py and update
docstrings and doctests to be unfiorm and more clear about when/when not jython
will probably work. Also consider porting wordcount example to jython
commit bd00c6c182dfa6d6ae5fb90b9f230b0a14051066
Author: Holden Karau <[email protected]>
Date: 2016-05-22T23:03:51Z
Start adding tests for jython functionality (broken)
commit 9e173d6463b89c1f7a7ddaff9f52ba48330f7b26
Author: Holden Karau <[email protected]>
Date: 2016-05-23T20:09:55Z
PySpark tests
commit 764929e092c665de69d6fe84bcbc39743feb2cb5
Author: Holden Karau <[email protected]>
Date: 2016-05-23T22:32:51Z
Update the tests, seems to work in py2 - need to fix issue with skipping
dill tests when dill is missing
commit e7bf7be0a669f958602b9b91c44ab4e7eaf2e3e6
Author: Holden Karau <[email protected]>
Date: 2016-05-23T23:31:27Z
Merge branch 'master' into SPARK-15369-investigate-selectively-using-jython
commit a404a5aad9ab38290de5a0f92c8b007f2167a645
Author: Holden Karau <[email protected]>
Date: 2016-05-24T00:09:19Z
Skip on dill not being available
commit ee57eefb866e83f14737ac9b35fc20a33bcb2ff7
Author: Holden Karau <[email protected]>
Date: 2016-05-24T03:48:51Z
Handle closure arguments (aww yeah) and make the tests pass (py2 w/dill ,
py3 w/o dill)
commit be55dedf20b23acb065fcb33701b2ba99b66eb90
Author: Holden Karau <[email protected]>
Date: 2016-05-24T04:50:44Z
Suppoer python 2 and 3 closures
commit 6e696288dbfb741e4a2c43697f8ab97da8c6afdd
Author: Holden Karau <[email protected]>
Date: 2016-05-24T18:42:11Z
Merge branch 'master' into SPARK-15369-investigate-selectively-using-jython
commit 7055e664865d2094d8b68b3743e8242a9513c5fb
Author: Holden Karau <[email protected]>
Date: 2016-05-25T00:06:09Z
broadcast the LazyJythonFunc
commit aacc3118243cd2c5dc1a6958288dca84ecdb4e41
Author: Holden Karau <[email protected]>
Date: 2016-05-25T01:01:03Z
Refactor a bit to simplify the imports/vars/setup code and allow skipping.
Also cleanup broadcast on python object delete
commit b4a8e220a86cd1ff2f15b35fffbe5939c24b52b6
Author: Holden Karau <[email protected]>
Date: 2016-05-25T01:22:42Z
pep8 fixes
commit c84aca613db09491f7b73afd03c83e2b67a88498
Author: Holden Karau <[email protected]>
Date: 2016-05-25T01:22:53Z
Start adding sql udf perf
commit 80507b108e514828323420e531400ace2a5bfe5e
Author: Holden Karau <[email protected]>
Date: 2016-05-25T01:24:59Z
pep8ify the new example
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]