GitHub user holdenk opened a pull request:
https://github.com/apache/spark/pull/15659
[WIP][SPARK-1267][SPARK-18129] Allow PySpark to be pip installed
## What changes were proposed in this pull request?
This PR aims to provide a pip installable PySpark package. This does a
bunch of work to copy the jars over and package them with the Python code (to
prevent challenges from trying to use different versions of the Python code
with different versions of the JAR). It does not currently publish to PyPI but
that is the natural follow up (SPARK-18129).
Done:
* pip installable on conda [manual tested]
* setup.py installed on a non-pip managed system (RHEL) with YARN [manual
tested]
* Automated testing of this (virtualenv)
* packaging and signing with release-build*
Possible follow up work:
* release-build update to publish to PyPI (SPARK-18129)
- figure out who owns the pyspark package name on prod PyPI (is it someone
with in the project or should we ask PyPI or should we choose a different name
to publish with like ApachePySpark?)
* Windows support and or testing ( SPARK-18136 )
* investigate details of wheel caching and see if we can avoid cleaning the
wheel cache during our test
* consider how we want to number our dev/snapshot versions
Explicitly out of scope:
* Using pip installed PySpark to start a standalone cluster
* Using pip installed PySpark for non-Python Spark programs
*I've done some work to test release-build locally but as a non-committer
I've just done local testing.
## How was this patch tested?
Automated testing with virtualenv, manual testing with conda, a system wide
install, and YARN integration.
release-build changes tested locally as a non-committer (no testing of
upload artifacts to Apache staging websites)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/holdenk/spark SPARK-1267-pip-install-pyspark
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15659.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15659
----
commit 7763f3c6d28a3246b40a849150746a220e03a112
Author: Juliet Hougland <[email protected]>
Date: 2016-04-14T14:11:37Z
Adds setup.py
commit 30debc7e6fa3a502d7991d2dee9cf48a69d92168
Author: Juliet Hougland <[email protected]>
Date: 2016-04-14T16:31:01Z
Fix spacing.
commit 5155531fce49a0915d6a2187d9adaffc70bfa3f3
Author: Juliet Hougland <[email protected]>
Date: 2016-10-12T05:54:36Z
updUpdate py4j dependency. Add mllib to extas_require, fix some indentation.
commit 2f0bf9b89db9a3a9362b73f2130a2c779fb01a76
Author: Juliet Hougland <[email protected]>
Date: 2016-10-12T06:03:22Z
Adds MANIFEST.in file.
commit 4c00b989c27bfe883775677cd1d8dfb930c42a51
Author: Holden Karau <[email protected]>
Date: 2016-10-12T16:44:16Z
Merge branch 'master' into SPARK-1267-pip-install-pyspark
commit 7ff8d0f465360463d1cd3b503d1d5d8aded7e88f
Author: Holden Karau <[email protected]>
Date: 2016-10-12T17:02:53Z
Start working towards post-2.0 pip installable PypSpark (so including list
of jars, fix extras_require decl, etc.)
commit 610b9752d33a37c261327536bb581bef20d46fd1
Author: Holden Karau <[email protected]>
Date: 2016-10-16T18:09:17Z
Merge branch 'master' into SPARK-1267-pip-install-pyspark
commit cb2e06d2e31e113dc29f5212fc9e05ba7d87fa8d
Author: Holden Karau <[email protected]>
Date: 2016-10-16T18:47:52Z
So MANIFEST and setup can't refer to things above the root of the project,
so create symlinks so we can package the JARs with it
commit 01f791db9c10378e01321855d33047785ef643b6
Author: Holden Karau <[email protected]>
Date: 2016-10-18T15:47:04Z
Merge branch 'master' into SPARK-1267-pip-install-pyspark
commit e2e4d1c9f42522db6ec981e6d650855a58150897
Author: Holden Karau <[email protected]>
Date: 2016-10-18T16:14:48Z
Keep the symlink
commit fb15d7e3e6b3be7c8c69d776649f4d556656f3f0
Author: Holden Karau <[email protected]>
Date: 2016-10-18T17:38:40Z
Some progress we need to use SDIST but is ok
commit aab7ee4fcd3bb4825a91f5c5a9baace9944c68d0
Author: Holden Karau <[email protected]>
Date: 2016-10-18T20:47:14Z
Reenable cleanup
commit 5a5762001946959fbcc96f8daf1510166ad5665e
Author: Holden Karau <[email protected]>
Date: 2016-10-19T14:13:50Z
Try and provide a clear error message when pip installed directly, fix
symlink farm issue, fix scripts issue, TODO: fix SPARK_HOME and find out why
JARs aren't ending up in the install
commit 646aa231cc8646b7bde3ec0df455bd64ec48eb00
Author: Holden Karau <[email protected]>
Date: 2016-10-19T22:56:01Z
Add two scripts
commit 36c9d45e741929d301ef54dadf33ae56a464f479
Author: Holden Karau <[email protected]>
Date: 2016-10-19T23:45:18Z
package_data doesn't work so well with nested directories so instead add
pyspark.bin and pyspark.jars packages and set their package dirs as desired,
make the spark scripts check and see if they are in a pip installed enviroment
and if SPARK_HOME in unset then resolve it with Python [otherwise use the
current behaviour]
commit a78754b778c28fe406ac8c60ede7dbea076a19a1
Author: Holden Karau <[email protected]>
Date: 2016-10-20T00:07:15Z
Use copyfile also check for jars dir too
commit 955e92b556b2af3f22acd78e8b800a44d900cb31
Author: Holden Karau <[email protected]>
Date: 2016-10-20T00:17:26Z
Check if pip installed when finding the shell file
commit 2d88a40c3c6236715b9fbe3af49dafb0999ccf00
Author: Holden Karau <[email protected]>
Date: 2016-10-20T00:19:40Z
Check if jars dir exists rather than release file
commit 9e5c5328e42a462b0f76a2ebad989dfa5b5dcdd5
Author: Holden Karau <[email protected]>
Date: 2016-10-23T15:52:48Z
Start working a bit on the docs
commit be7eadd1af3bc26e952f732d1fb4433bc6dd94e3
Author: Holden Karau <[email protected]>
Date: 2016-10-23T15:53:27Z
Merge branch 'master' into SPARK-1267-pip-install-pyspark
commit 07d384982caa069e96cc2ac64b9faa9dc19ddc00
Author: Holden Karau <[email protected]>
Date: 2016-10-23T21:22:59Z
Try and include pyspark zip file for yarn use
commit 11b5fa85cbaed0866455a28e88f7868428c36219
Author: Holden Karau <[email protected]>
Date: 2016-10-23T23:46:28Z
Copy pyspark zip for use in yarn cluster mode
commit 8791f829469f163ff195647d6250bee6f53d0dc4
Author: Holden Karau <[email protected]>
Date: 2016-10-24T12:56:06Z
Start adding scripts to test pip installability
commit 92837a3a561cf96746c795d11aa60c2e82e6fa2d
Author: Holden Karau <[email protected]>
Date: 2016-10-24T13:40:05Z
Works on yarn, works with spark submit, still need to fix import based
spark home finder
commit 6947a855f5567eba80b6c3a9cfe97a3fc53fe863
Author: Holden Karau <[email protected]>
Date: 2016-10-24T14:00:00Z
Start updating find-spark-home to be available in many cases.
commit 944160cabbaa96ed00a3d6ff4b7ddff9d29d204a
Author: Holden Karau <[email protected]>
Date: 2016-10-24T14:08:51Z
Use Switch to find_spark_home.py
commit 5bf0746dea5db4421a6ae8edc96de6d567f460e3
Author: Holden Karau <[email protected]>
Date: 2016-10-24T14:09:03Z
Move to under pyspark
commit 435f8427a6ca5bdfae25ba439822e44b7fd4eff4
Author: Holden Karau <[email protected]>
Date: 2016-10-24T14:13:12Z
Update to py4j 0.10.4 in the deps, also switch how we are copying
find_spark_home.py around
commit 27ca27eda451cc4edbdb1811bef4c07bdafc98ef
Author: Holden Karau <[email protected]>
Date: 2016-10-24T14:16:59Z
Update java gateway to use _find_spark_home function, add quick sanity
check file
commit df126cf219b9367792e9a25b7d3493b7a060daee
Author: Holden Karau <[email protected]>
Date: 2016-10-24T14:45:23Z
Lint fixes
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]