GitHub user holdenk opened a pull request:

    https://github.com/apache/spark/pull/13571

    [SPARK-15369][WIP][RFC][PySpark][SQL] Expose potential to use Jython for 
PySpark UDFs

    This is an early work in progress / RFC PR to see what interest exists / 
thoughts are around offering Jython for some PySpark UDF evaluation.
    
    ## What changes were proposed in this pull request?
    
    Transferring data from the JVM to the Python executor can be a substantial 
bottleneck. While Jython is not suitable for all UDFs or map functions, it may 
be suitable for some simple ones. An early draft of this, with a tokenization 
UDF, found Jython UDF to be ~65% faster than Python UDF and ~2% slower than a 
native Scala UDF for multiple runs. The first run with a Jython UDF involves 
starting the Jython interpreter on the workers, but even in those cases it 
outperforms regular PySpark UDFs by ~20%.
    
    
    ## How was this patch tested?
    
    unit tests, doc tests, and benchmark (see 
https://docs.google.com/document/d/1L-F12nVWSLEOW72sqOn6Mt1C0bcPFP9ck7gEMH2_IXE/edit?usp=sharing
 ).


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/holdenk/spark 
SPARK-15369-investigate-selectively-using-jython

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13571.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13571
    
----
commit 09d0d5cc597bb76b5deee91e7aa1705b5e0281f9
Author: Holden Karau <[email protected]>
Date:   2016-05-20T18:57:53Z

    Start work on Jython UDF support

commit 64954e4f1540fe8f82aa6f6dfbb5871072aaf4e6
Author: Holden Karau <[email protected]>
Date:   2016-05-20T22:01:08Z

    More work on the calling it from Python side

commit b8201358dc5274c2567843d5cb5d1d6f6a086925
Author: Holden Karau <[email protected]>
Date:   2016-05-20T23:10:53Z

    Ok the basics work but maybe not to use reflection in base for int/long

commit f6462e376646278426e286251cbe0b1389623d5b
Author: Holden Karau <[email protected]>
Date:   2016-05-21T00:48:35Z

    Ok it now works for single elem inputs and integer/array of string returns

commit 0c16ff3116116cbdf8e74d51ecf6f75cdf266960
Author: Holden Karau <[email protected]>
Date:   2016-05-21T02:16:54Z

    Take zero to 2 arguments

commit 0a74efcfa1c3f804bce907544b470c253b5b9ca8
Author: Holden Karau <[email protected]>
Date:   2016-05-21T02:23:03Z

    PyLint and expand a bit on the error cases

commit 21e9f4ec3c0c8a24220facbe52a1d21c42c1e531
Author: Holden Karau <[email protected]>
Date:   2016-05-21T06:02:16Z

    Switch from json back to pickle

commit 4d726474ae3ae81f1f65101369949de25ce6e375
Author: Holden Karau <[email protected]>
Date:   2016-05-21T07:14:26Z

    Reeeeealllllly sketchy Row-ish-support-ish

commit c9782154d5d3456f66b3e7b8d6aa57e2c6bbee4f
Author: Holden Karau <[email protected]>
Date:   2016-05-21T07:16:44Z

    Style fixes

commit fed0beb72d68454a28400eb90964fe7968a9ee4d
Author: Holden Karau <[email protected]>
Date:   2016-05-22T02:39:12Z

    Use generic Row

commit d530712d369b72ec9c01cf2c14690615695bcd9e
Author: Holden Karau <[email protected]>
Date:   2016-05-22T03:21:47Z

    Start on a bit of ScalaDoc and mark classes as private

commit 68ba3b826ab5188402cd7cb080386af0cd49638a
Author: Holden Karau <[email protected]>
Date:   2016-05-22T03:22:11Z

    Remove debug prints

commit b1b39bbbe29c409fd5f910aa4e5252bbb5a4bf27
Author: Holden Karau <[email protected]>
Date:   2016-05-22T03:26:39Z

    Doc params

commit 27882857c57f2a7a1d224cd49257646b7f5099a7
Author: Holden Karau <[email protected]>
Date:   2016-05-22T04:12:40Z

    Start adding some tests

commit 6e964307d052482e702c98f2cb3cdde11e75bf83
Author: Holden Karau <[email protected]>
Date:   2016-05-22T04:29:29Z

    Remove some ignores

commit b6f4aa37d18004e871b82dc7c5aff72f0902498e
Author: Holden Karau <[email protected]>
Date:   2016-05-22T04:58:59Z

    Merge branch 'master' into SPARK-15369-investigate-selectively-using-jython

commit 87900b4a1432bd0a1606812777c01b4945cceb78
Author: Holden Karau <[email protected]>
Date:   2016-05-22T07:58:07Z

    python3 compatability (yay), also doctests + dill doesn't play super well 
together so limit the doctests. TODO: copy more tests in tests.py and update 
docstrings and doctests to be unfiorm and more clear about when/when not jython 
will probably work. Also consider porting wordcount example to jython

commit bd00c6c182dfa6d6ae5fb90b9f230b0a14051066
Author: Holden Karau <[email protected]>
Date:   2016-05-22T23:03:51Z

    Start adding tests for jython functionality (broken)

commit 9e173d6463b89c1f7a7ddaff9f52ba48330f7b26
Author: Holden Karau <[email protected]>
Date:   2016-05-23T20:09:55Z

    PySpark tests

commit 764929e092c665de69d6fe84bcbc39743feb2cb5
Author: Holden Karau <[email protected]>
Date:   2016-05-23T22:32:51Z

    Update the tests, seems to work in py2 - need to fix issue with skipping 
dill tests when dill is missing

commit e7bf7be0a669f958602b9b91c44ab4e7eaf2e3e6
Author: Holden Karau <[email protected]>
Date:   2016-05-23T23:31:27Z

    Merge branch 'master' into SPARK-15369-investigate-selectively-using-jython

commit a404a5aad9ab38290de5a0f92c8b007f2167a645
Author: Holden Karau <[email protected]>
Date:   2016-05-24T00:09:19Z

    Skip on dill not being available

commit ee57eefb866e83f14737ac9b35fc20a33bcb2ff7
Author: Holden Karau <[email protected]>
Date:   2016-05-24T03:48:51Z

    Handle closure arguments (aww yeah) and make the tests pass (py2 w/dill , 
py3 w/o dill)

commit be55dedf20b23acb065fcb33701b2ba99b66eb90
Author: Holden Karau <[email protected]>
Date:   2016-05-24T04:50:44Z

    Suppoer python 2 and 3 closures

commit 6e696288dbfb741e4a2c43697f8ab97da8c6afdd
Author: Holden Karau <[email protected]>
Date:   2016-05-24T18:42:11Z

    Merge branch 'master' into SPARK-15369-investigate-selectively-using-jython

commit 7055e664865d2094d8b68b3743e8242a9513c5fb
Author: Holden Karau <[email protected]>
Date:   2016-05-25T00:06:09Z

    broadcast the LazyJythonFunc

commit aacc3118243cd2c5dc1a6958288dca84ecdb4e41
Author: Holden Karau <[email protected]>
Date:   2016-05-25T01:01:03Z

    Refactor a bit to simplify the imports/vars/setup code and allow skipping. 
Also cleanup broadcast on python object delete

commit b4a8e220a86cd1ff2f15b35fffbe5939c24b52b6
Author: Holden Karau <[email protected]>
Date:   2016-05-25T01:22:42Z

    pep8 fixes

commit c84aca613db09491f7b73afd03c83e2b67a88498
Author: Holden Karau <[email protected]>
Date:   2016-05-25T01:22:53Z

    Start adding sql udf perf

commit 80507b108e514828323420e531400ace2a5bfe5e
Author: Holden Karau <[email protected]>
Date:   2016-05-25T01:24:59Z

    pep8ify the new example

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to